Title: | Downstream analyses of promoter sequence architectures and HTML report generation |
---|---|
Description: | seqArchRplus facilitates downstream analyses of promoter sequence architectures/clusters identified by seqArchR (or any other tool/method). With additional available information such as the TPM values and interquantile widths (IQWs) of the CAGE tag clusters, seqArchRplus can order the input promoter clusters by their shape (IQWs), and write the cluster information as browser/IGV track files. Provided visualizations are of two kind: per sample/stage and per cluster visualizations. Those of the first kind include: plot panels for each sample showing per cluster shape, TPM and other score distributions, sequence logos, and peak annotations. The second include per cluster chromosome-wise and strand distributions, motif occurrence heatmaps and GO term enrichments. Additionally, seqArchRplus can also generate HTML reports for easy viewing and comparison of promoter architectures between samples/stages. |
Authors: | Sarvesh Nikumbh [aut, cre, cph] |
Maintainer: | Sarvesh Nikumbh <[email protected]> |
License: | GPL-3 |
Version: | 1.7.0 |
Built: | 2025-01-16 03:29:09 UTC |
Source: | https://github.com/bioc/seqArchRplus |
seqArchR result stores the clusters obtained at every iteration. It is possible that the previously chosen agglomeration and/or distance method used with hierarchical clustering does not yield reasonable clusters. This function enables minor curation of the clusters obtained from the hierarchical clustering step.
curate_clusters( sname, use_aggl = "ward.D", use_dist = "euclid", seqArchR_result, iter, pos_lab = NULL, regularize = TRUE, topn = 50, use_cutk = 2, need_change = NULL, change_to = NULL, final = FALSE, dir_path = NULL )
curate_clusters( sname, use_aggl = "ward.D", use_dist = "euclid", seqArchR_result, iter, pos_lab = NULL, regularize = TRUE, topn = 50, use_cutk = 2, need_change = NULL, change_to = NULL, final = FALSE, dir_path = NULL )
sname |
The sample name |
use_aggl |
The agglomeration method to be used for hierarchical clustering. This is passed on to 'seqArchR::collate_seqArchR_result'. See argument 'aggl_method' in collate_seqArchR_result. |
use_dist |
The distance method to be used for hierarchical clustering.
This is passed on to |
seqArchR_result |
The seqArchR result object. |
iter |
Specify which iteration of the seqArchR result should be used for obtaining clusters. |
pos_lab |
The position labels |
regularize |
Logical. Specify TRUE if the basis vector comparison
is to be regularized. Requires you to set |
topn |
Numeric. The top N features (nucleotide-position pairs)
that will be used for distance computation, rest will be ignored.
See argument 'topn' in |
use_cutk |
Value of K (number of clusters) for cutting the hierarchical clustering tree. |
need_change |
A list of elements (cluster IDs in the input clusters) that need re-assignment. Elements |
change_to |
A list of elements (cluster IDs in the input clusters)
to be assigned to those that need re-assignment. In case there is a candidate
that needs to be put into a new, independent cluster of itself, use
0 (as numeric). Both |
final |
Logical, set to TRUE or FALSE |
dir_path |
The /path/to/the/directory where files will be written. Default is NULL. |
This function helps the user work through the curation in at most three steps.
1. This function performs hierarchical clustering on the seqArchR clusters (of the specified iteration) to obtain a clustering result. The resulting clustering is visualized by a dendrogram, color-coded cluster assignments, and corresponding sequence logos. Using this visualization, the user can identify/estimate the (nearly right) number of clusters to cut the dendrogram. The **first call** uses K = 2.
2. Visually examine and count the tentative number of clusters (K) deemed right. Because these architectures are now arranged by the hierarchical clustering order, identifying this tentative value of K is much easier. Call the function this **second** time with the identified value of K. Look at the visualization now generated to determine if it is good enough, i.e., it requires only minor re-assignments.
3. Identify cases of cluster assignments that you wish to re-assign to different clusters. These can be noted as a list and supplied in a subsequent call to the function.
4. In the **final call** to the function, set 'final = TRUE', supply the re- assignments as two lists 'need_change' and 'change_to'.
More on re-assignments using arguments need_change
and
change_to
: If any element is to be put into a new cluster, use a
numeral 0 in change_to. This can be done in either scenario: when any
element is re-assigned as a singleton cluster of itself, or as clustered
with some other element coming from some other existing cluster. Consider,
for instance, among some 33 clusters identified by seqArchR, the following
re-assignments are executed.
Original_clustering <- list(c(1), c(2,3), c(4,5), c(6, 7)) need_change <- list(c(5), c(2), c(3, 6)) change_to <- list(1, 0, 0)
In the above, element 5 is re-assigned to the cluster containing element 1.
Element 2 is re-assigned to a new, singleton cluster of itself, while
elements 3 and 6 (which initially can belong to same/any two clusters) are
collated together. Note that it is important to use c()
.
Also see examples below.
This function returns a list holding (a) 'curation_plot': plot showing the dendrogram + sequence logos, (b) 'clust_assignments': the cluster (re-)assignments performed, and (c) 'clusters_list': the sequence clusters as a list.
to help perform curation and document it.
When ‘final = FALSE', the 'curation_plot’ shows the dendrogram + sequence
logos of clusters (ordered by hclust
ordering).
The 'clusters_list' holds the hclust ordered clusters.
If the 'dir_path'
is specified, a PDF file showing the same figure is also written at the
location using the default filename
'<Sample_name>_dend_arch_list_reg_top50_euclid_complete_<K>clusters.pdf'.
When ‘final = TRUE', ’clusters_list' holds the clusters with re-assignments executed, and 'curation_plot' of dendrogram + sequence logos now has an additional panel showing the sequence logos upon collation of clusters as specified by the re-assignments. Also the cluster IDs in the dendrograms have colors showing the re-assignments, i.e., elements that were re-assigned to different clusters, also have appropriate color changes reflected.
When 'dir_path' is provided, the curation plot is written to disk using the same filename as before except for suffix 'final' attched to it.
Sarvesh Nikumbh
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) seqArchR_result <- readRDS(system.file("extdata", "seqArchR_result.rds", package = "seqArchRplus", mustWork = TRUE)) ## The example seqArchR result generally holds the raw sequences, but in ## this case, the result object is devoid of them in order to reduce size of ## example data if(!("rawSeqs" %in% names(seqArchR_result))) seqArchR_result$rawSeqs <- raw_seqs use_aggl <- "complete" use_dist <- "euclid" ## get seqArchR clusters custom curated seqArchR_clusts <- seqArchRplus::curate_clusters(sname = "sample1", use_aggl = use_aggl, use_dist = use_dist, seqArchR_result = seqArchR_result, iter = 5, pos_lab = NULL, regularize = FALSE, topn = 50, use_cutk = 5, final = FALSE, dir_path = tempdir()) ## Form the lists need_change and change_to for re-assignments need_change <- list(c(2)) change_to <- list(c(1)) ## This fetches us clusters with custom/curated collation in _arbitrary_ ## order. See the next function call to order_clusters_iqw that orders ## these clusters by their median/mean IQW seqArchR_clusts <- seqArchRplus::curate_clusters(sname = "sample1", use_aggl = use_aggl, use_dist = use_dist, seqArchR_result = seqArchR_result, iter = 5, pos_lab = NULL, regularize = FALSE, topn = 50, use_cutk = 5, need_change = need_change, change_to = change_to, final = TRUE, dir_path = tempdir())
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) seqArchR_result <- readRDS(system.file("extdata", "seqArchR_result.rds", package = "seqArchRplus", mustWork = TRUE)) ## The example seqArchR result generally holds the raw sequences, but in ## this case, the result object is devoid of them in order to reduce size of ## example data if(!("rawSeqs" %in% names(seqArchR_result))) seqArchR_result$rawSeqs <- raw_seqs use_aggl <- "complete" use_dist <- "euclid" ## get seqArchR clusters custom curated seqArchR_clusts <- seqArchRplus::curate_clusters(sname = "sample1", use_aggl = use_aggl, use_dist = use_dist, seqArchR_result = seqArchR_result, iter = 5, pos_lab = NULL, regularize = FALSE, topn = 50, use_cutk = 5, final = FALSE, dir_path = tempdir()) ## Form the lists need_change and change_to for re-assignments need_change <- list(c(2)) change_to <- list(c(1)) ## This fetches us clusters with custom/curated collation in _arbitrary_ ## order. See the next function call to order_clusters_iqw that orders ## these clusters by their median/mean IQW seqArchR_clusts <- seqArchRplus::curate_clusters(sname = "sample1", use_aggl = use_aggl, use_dist = use_dist, seqArchR_result = seqArchR_result, iter = 5, pos_lab = NULL, regularize = FALSE, topn = 50, use_cutk = 5, need_change = need_change, change_to = change_to, final = TRUE, dir_path = tempdir())
For a given sample, this function forms a combined panel of three plots namely, the IQW-TPM boxplots, cluster sequence logos, and annotations per cluster. All of these individual plots can be generated using existing seqArchRplus functions
form_combined_panel(iqw_tpm_pl, seqlogos_pl, annot_pl)
form_combined_panel(iqw_tpm_pl, seqlogos_pl, annot_pl)
iqw_tpm_pl |
The IQW-TPM plot generated using
|
seqlogos_pl |
The sequence logos oneplot obtained from
|
annot_pl |
The annotations oneplot obtained from
|
This functionality requires cowplot. The combined plot panels are arranged horizontally.
This function returns a ggplot
plot object.
Sarvesh Nikumbh
Generate all plots for a given sample
generate_all_plots( sname, bed_info_fname, custom_colnames = NULL, seqArchR_clusts, raw_seqs, cager_obj = NULL, tc_gr = NULL, use_q_bound = FALSE, use_as_names = NULL, order_by_iqw = TRUE, use_median_iqw = TRUE, iqw = TRUE, tpm = TRUE, cons = FALSE, pos_lab = NULL, txdb_obj = NULL, orgdb_obj = NULL, org_name = NULL, qLow = 0.1, qUp = 0.9, tss_region = c(-500, 100), raw_seqs_mh = NULL, motifs = c("WW", "SS", "TATAA", "CG"), motif_heatmaps_flanks = c(50, 100, 200), motif_heatmaps_res = 300, motif_heatmaps_dev = "png", dir_path, txt_size = 25 )
generate_all_plots( sname, bed_info_fname, custom_colnames = NULL, seqArchR_clusts, raw_seqs, cager_obj = NULL, tc_gr = NULL, use_q_bound = FALSE, use_as_names = NULL, order_by_iqw = TRUE, use_median_iqw = TRUE, iqw = TRUE, tpm = TRUE, cons = FALSE, pos_lab = NULL, txdb_obj = NULL, orgdb_obj = NULL, org_name = NULL, qLow = 0.1, qUp = 0.9, tss_region = c(-500, 100), raw_seqs_mh = NULL, motifs = c("WW", "SS", "TATAA", "CG"), motif_heatmaps_flanks = c(50, 100, 200), motif_heatmaps_res = 300, motif_heatmaps_dev = "png", dir_path, txt_size = 25 )
sname |
The sample name |
bed_info_fname |
The BED filename with information on tag clusters. See details for expected columns (column names)/information |
custom_colnames |
Specify custom column/header names to be used with the BED file information |
seqArchR_clusts |
The seqArchR clusters' list |
raw_seqs |
The sequences corresponding to the cluster elements (also available from the seqArchR result object) |
cager_obj |
The CAGEr object. This expects that
|
tc_gr |
The tag clusters as a |
use_q_bound |
Logical. Write the lower and upper quantiles as tag cluster boundaries in BED track files with tag clusters. Default is TRUE |
use_as_names |
Specify the column name from info_df which you would like to use as names for display with the track. By default, 'use_names' is NULL, and the sequence/tag cluster IDs are used as names |
order_by_iqw |
Logical. Set TRUE to order clusters by the IQW median or mean. Set argument 'use_median_iqw' to TRUE to use the median, else will use mean if FALSE |
use_median_iqw |
Logical. Set TRUE if the median IQW for each cluster is used to order the clusters. Otherwise, the mean IQW will be used when set to FALSE |
iqw , tpm , cons
|
Logical. Specify TRUE when the corresponding plots should be included |
pos_lab |
The position labels |
txdb_obj |
The TranscriptsDB object for the organism |
orgdb_obj |
The OrgDb object for the organism |
org_name |
The organism name. This is used in the tracknames for tracks writte nas BED files |
qLow , qUp
|
Numeric values between 0 and 1. These are required when cager_obj is provided instead of the tag clusters 'tc_gr' |
tss_region |
For ChIPseeker, "region range of TSS" |
raw_seqs_mh |
Specify the sequences to be used for motif heatmaps, if they are different than the sequences clustered by seqArchr. Default is NULL, when 'raw_seqs“ are used |
motifs |
Specify a character vector of motif words (using IUPAC notation) to be visualized as a heatmap |
motif_heatmaps_flanks |
Specify a vector of different flank values to be considered for visualization. Same size flanks are considered upstream as well as downstream, hence one value suffices for eac hvisualization. When a vector 'c(50, 100, 200)' is specified, three motif heatmap files (three separate PNG files) are created, each with one flank size. The motif heatmap file will contain separate heatmaps for each of the specified motifs in the 'motifs' argument |
motif_heatmaps_res |
The resolution for the motif heatmaps. Default is 300 ppi |
motif_heatmaps_dev |
The device to be used for plotting. Default is "png". Other options available are "tiff", "pdf", and "svg" |
dir_path |
The path to the directory where files are saved |
txt_size |
The text size to be used for the plots. This is some high value because the plots are written with 'dpi=300' and are often large in size, especially the combined panel plots |
The expected columns (and column names) in the BED file are "chr", "start", "end", "width", "strand", "score", "nr_ctss", "dominant_ctss", "domTPM", "q_<qLow>", "q_<qUp>", "IQW", "tpm". Depending on the values for arguments qLow and qUp, the corresponding column names are formed. For example, if 'qLow' and 'qUp' are 0.1 and 0.9, the column names are "q_0.1" and "q_0.9". These columns are mostly present by default in the CAGEr tag clusters.
The supplied clusters are ordered by their mean/median interquantile widths before proceeding to generate the visualizations.
A list holding generated plots; some which are directly written to disk are not included in this list.
The included plots are:
- Boxplots of IQW (and TPM and conservation score when available) distributions for each cluster (as a single combined plot) - Annotation percentages per cluster as stacked barplots (as a single combined plot) - Annotation percentages per cluster as stacked barplots as a list - Sequence logos of all cluster architectures (as a single combined plot) - Sequence logos of all cluster architectures as a list - Strand-separated sequence logos of all cluster architectures as a list - Per cluster distribution of tag clusters on chromosomes and strands
In addition, the following plots are written to disk: - Visualization of all clustered sequences as a matrix - Visualization of motif occurrences (for specified motifs) in all clustered sequences
In addition, the individual clusters from seqArchR are written to disk as BED track files that can be viewed in the genome browser/IGV.
Sarvesh Nikumbh
library(GenomicRanges) library(TxDb.Dmelanogaster.UCSC.dm6.ensGene) library(ChIPseeker) library(Biostrings) bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) ## info_df <- read.delim(file = bed_fname, ## sep = "\t", header = TRUE) tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds", package = "seqArchRplus", mustWork = TRUE)) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) raw_seqs_mh <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters200.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) all_plots <- generate_all_plots(sname = "sample1", bed_info_fname = bed_fname, seqArchR_clusts = use_clusts, raw_seqs = raw_seqs, tc_gr = tc_gr, use_q_bound = FALSE, order_by_iqw = FALSE, use_median_iqw = TRUE, iqw = TRUE, tpm = TRUE, cons = FALSE, pos_lab = -45:45, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, org_name = "Dmelanogaster22", qLow = 0.1, qUp = 0.9, tss_region = c(-500, 100), raw_seqs_mh = raw_seqs_mh, motifs = c("WW", "SS", "TATAA", "CG"), motif_heatmaps_flanks = c(50, 100, 200), motif_heatmaps_res = 150, motif_heatmaps_dev = "png", dir_path = tempdir(), txt_size = 25)
library(GenomicRanges) library(TxDb.Dmelanogaster.UCSC.dm6.ensGene) library(ChIPseeker) library(Biostrings) bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) ## info_df <- read.delim(file = bed_fname, ## sep = "\t", header = TRUE) tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds", package = "seqArchRplus", mustWork = TRUE)) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) raw_seqs_mh <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters200.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) all_plots <- generate_all_plots(sname = "sample1", bed_info_fname = bed_fname, seqArchR_clusts = use_clusts, raw_seqs = raw_seqs, tc_gr = tc_gr, use_q_bound = FALSE, order_by_iqw = FALSE, use_median_iqw = TRUE, iqw = TRUE, tpm = TRUE, cons = FALSE, pos_lab = -45:45, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, org_name = "Dmelanogaster22", qLow = 0.1, qUp = 0.9, tss_region = c(-500, 100), raw_seqs_mh = raw_seqs_mh, motifs = c("WW", "SS", "TATAA", "CG"), motif_heatmaps_flanks = c(50, 100, 200), motif_heatmaps_res = 150, motif_heatmaps_dev = "png", dir_path = tempdir(), txt_size = 25)
This function generates an HTML report with large scrollable combined panels for multiple samples that eases comparison of changes between samples
generate_html_report( snames, file_type = "PDF", img_ht = "1200px", img_wd = "1600px", page_wd = "1800px", render_silently = TRUE, dir_path )
generate_html_report( snames, file_type = "PDF", img_ht = "1200px", img_wd = "1600px", page_wd = "1800px", render_silently = TRUE, dir_path )
snames |
Sample names to be included |
file_type |
"PDF" (default) or "SVG" (to be supported in the future). The type of files to look for in the sample-specific results folder |
img_ht , img_wd
|
The height and width (in pixels) of the images in the output HTML report. Default values are '1200px' (height) and '1600px' (width) |
page_wd |
The width of the body in the HTML. Default is '1800px'. |
render_silently |
Logical. TRUE or FALSE |
dir_path |
Specify the '/path/to/directory' where sample-specific results folders are located. This is a required argument and cannot be NULL. A directory named 'combined_results' is created at the given location, and the HTML report is written into it |
This functionality requires suggested libraries
`slickR`
and `pdftools`
installed.
The function assumes requires that the candidate figure files have combined_panel
Note that the combined plot panels are arranged horizontally and
therefore are best viewed in wide desktop monitors.
Nothing. Report is written to disk at the provided `dir_path`
using the filename
'Combined_panels_report_samples_<samples_names>.html'
.
Sarvesh Nikumbh
## Need these packages to run these examples if(require("slickR", "pdftools")){ ## Make IQW-TPM plots bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE, col.names = c("chr", "start", "end", "width", "dominant_ctss", "domTPM", "strand", "score", "nr_ctss", "q_0.1", "q_0.9", "IQW", "tpm")) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) use_dir <- tempdir() iqw_tpm_pl <- iqw_tpm_plots(sname = "sample1", dir_path = use_dir, info_df = info_df, iqw = TRUE, tpm = TRUE, cons = FALSE, clusts = use_clusts, txt_size = 14) ## Make sequence logos library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) seqlogo_oneplot_pl <- per_cluster_seqlogos(sname = "sample1", seqs = raw_seqs, clusts = use_clusts, pos_lab = -45:45, bits_yax = "auto", strand_sep = FALSE, one_plot = TRUE, dir_path = use_dir, txt_size = 14) ## Need the TxDb object to run these examples annotations_pl <- NULL if(require("TxDb.Dmelanogaster.UCSC.dm6.ensGene")){ annotations_pl <- per_cluster_annotations(sname = "sample1", clusts = NULL, tc_gr = bed_fname, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, one_plot = FALSE, dir_path = use_dir, tss_region = c(-500,100)) } ## Combine them together if(!is.null(annotations_pl)){ panel_pl <- form_combined_panel(iqw_tpm_pl = iqw_tpm_pl, seqlogos_pl = seqlogos_oneplot_pl, annot_pl = annotations_oneplot_pl) }else{ panel_pl <- form_combined_panel(iqw_tpm_pl = iqw_tpm_pl, seqlogos_pl = seqlogos_oneplot_pl) } cowplot::save_plot(filename = file.path(use_dir, paste0("sample1_combined_panel.pdf")), plot = panel_pl) ## Call function to generate HTML report generate_html_report(snames = c("sample1", "sample1"), dir_path = use_dir) }
## Need these packages to run these examples if(require("slickR", "pdftools")){ ## Make IQW-TPM plots bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE, col.names = c("chr", "start", "end", "width", "dominant_ctss", "domTPM", "strand", "score", "nr_ctss", "q_0.1", "q_0.9", "IQW", "tpm")) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) use_dir <- tempdir() iqw_tpm_pl <- iqw_tpm_plots(sname = "sample1", dir_path = use_dir, info_df = info_df, iqw = TRUE, tpm = TRUE, cons = FALSE, clusts = use_clusts, txt_size = 14) ## Make sequence logos library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) seqlogo_oneplot_pl <- per_cluster_seqlogos(sname = "sample1", seqs = raw_seqs, clusts = use_clusts, pos_lab = -45:45, bits_yax = "auto", strand_sep = FALSE, one_plot = TRUE, dir_path = use_dir, txt_size = 14) ## Need the TxDb object to run these examples annotations_pl <- NULL if(require("TxDb.Dmelanogaster.UCSC.dm6.ensGene")){ annotations_pl <- per_cluster_annotations(sname = "sample1", clusts = NULL, tc_gr = bed_fname, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, one_plot = FALSE, dir_path = use_dir, tss_region = c(-500,100)) } ## Combine them together if(!is.null(annotations_pl)){ panel_pl <- form_combined_panel(iqw_tpm_pl = iqw_tpm_pl, seqlogos_pl = seqlogos_oneplot_pl, annot_pl = annotations_oneplot_pl) }else{ panel_pl <- form_combined_panel(iqw_tpm_pl = iqw_tpm_pl, seqlogos_pl = seqlogos_oneplot_pl) } cowplot::save_plot(filename = file.path(use_dir, paste0("sample1_combined_panel.pdf")), plot = panel_pl) ## Call function to generate HTML report generate_html_report(snames = c("sample1", "sample1"), dir_path = use_dir) }
Handle writing of tag clusters obtained from a CAGE experiment to disk as BED files. It can also return corresponding promoter sequences as a DNAStringSet object or write them to disk as FASTA files.
For information on tag clusters obtained by clustering of CAGE-derived TSS, we refer the reader to the CAGEr vignette, Section 3.7 https://www.bioconductor.org/packages/release/bioc/vignettes/CAGEr/inst/doc/CAGEexp.html#ctss-clustering
handle_tc_from_cage( sname, tc_gr, cager_obj, qLow = 0.1, qUp = 0.9, fl_size_up = 500, fl_size_down = 500, bsgenome, dir_path = NULL, fname_prefix = NULL, fname_suffix = NULL, write_to_disk = TRUE, ret_seqs = TRUE )
handle_tc_from_cage( sname, tc_gr, cager_obj, qLow = 0.1, qUp = 0.9, fl_size_up = 500, fl_size_down = 500, bsgenome, dir_path = NULL, fname_prefix = NULL, fname_suffix = NULL, write_to_disk = TRUE, ret_seqs = TRUE )
sname |
The sample name |
tc_gr |
Tag clusters as |
cager_obj |
A CAGEexp object obtained from the CAGEr package, if and when CAGEr was used to process the raw CAGE data |
qLow , qUp
|
The interquantile boundaries to be considered for obtaining
tag clusters from the CAGEexp object. See |
fl_size_up , fl_size_down
|
Numeric. The size of the flanks in the upstream and downstream directions. |
bsgenome |
The BSgenome file that will be used to obtain sequences of the organism from. |
dir_path |
The path to the directory where files will be written. By default, all BED files are written within a subdirectory named "BED", and all FASTA files are written within a subdirectory named "FASTA", both created at the 'dir_path' location. |
fname_prefix , fname_suffix
|
Specify any prefix or suffix string to be used in the filename. This can be the organism name etc. Specify without any separator. By default, an underscore is used as a separator in the filename. |
write_to_disk |
Logical. Specify TRUE to write files to disk. More specifically, BED files are written to disk only when this is set to TRUE. For promoter sequences, FASTA files are written to disk if this arg is set to TRUE, otherwise not. and a DNAStringSet object is returned if 'ret_seqs' is set to TRUE. |
ret_seqs |
Logical. Specify TRUE if promoter sequences are to be returned as a DNAStringSet object. |
You can use the fname_prefix and fname_suffix arguments to specify strings to be used as prefix and suffix for the files. For example, the organism name can be used as a prefix for the filename. Similarly, for suffix.
If 'ret_seqs' is TRUE, a DNAStringSet object is returned. Depending on 'write_to_disk', files are written to disk at the specified location.
If 'ret_seq = TRUE', the promoter sequences are returned as a
DNAStringSet
object.
Sarvesh Nikumbh
if(require("BSgenome.Dmelanogaster.UCSC.dm6")){ tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds", package = "seqArchRplus", mustWork = TRUE)) seqs <- seqArchRplus::handle_tc_from_cage(sname = "sample1", tc_gr = tc_gr, fl_size_up = 500, fl_size_down = 500, dir_path = NULL, fname_prefix = "pre", fname_suffix = "suff", write_to_disk = FALSE, bsgenome = BSgenome.Dmelanogaster.UCSC.dm6, ret_seqs = TRUE) }
if(require("BSgenome.Dmelanogaster.UCSC.dm6")){ tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds", package = "seqArchRplus", mustWork = TRUE)) seqs <- seqArchRplus::handle_tc_from_cage(sname = "sample1", tc_gr = tc_gr, fl_size_up = 500, fl_size_down = 500, dir_path = NULL, fname_prefix = "pre", fname_suffix = "suff", write_to_disk = FALSE, bsgenome = BSgenome.Dmelanogaster.UCSC.dm6, ret_seqs = TRUE) }
IQW, TPM plots
iqw_tpm_plots( sname, info_df, clusts, iqw = TRUE, tpm = TRUE, cons = TRUE, order_by_median = TRUE, dir_path = NULL, txt_size = 12, use_suffix = NULL, use_prefix = "C", wTitle = TRUE )
iqw_tpm_plots( sname, info_df, clusts, iqw = TRUE, tpm = TRUE, cons = TRUE, order_by_median = TRUE, dir_path = NULL, txt_size = 12, use_suffix = NULL, use_prefix = "C", wTitle = TRUE )
sname |
Sample name |
info_df |
A DataFrame object holding information on the tag clusters. Expected columns (names) are 'chr', 'start', 'end', 'IQW', 'domTPM', and 'strand'. |
clusts |
List of sequence ids in each cluster. |
iqw |
Logical. Specify TRUE if boxplots of interquantile widths (IQW) of the tag clusters corresponding to the promoters in each cluster are to be plotted. |
tpm |
Logical. Specify TRUE if boxplots of TPM values for all clusters are to be plotted. |
cons |
Logical. Specify TRUE if boxplots of conservation scores (PhastCons scores) for all clusters. If this is TRUE, an additional column named 'cons' is expected in the 'info_df'. |
order_by_median |
Logical. Whether to order to clusters by their median (when TRUE) or mean (when FALSE) interquantile widths. |
dir_path |
The /path/to/the/directory where files will be written. Default is NULL. |
txt_size |
Specify text size to be used in the plots. |
use_suffix , use_prefix
|
Character. Specify any suffix and/or prefix you wish to add to the filename. |
wTitle |
Logical. If TRUE, the returned plot will contain a default title, which is the same as the filename. See details. |
The plots are written to a file named "Sample_<sample_name>_IQW_TPM_Cons_plot.pdf" if all of 'iqw', 'tpm', and 'cons' are set to TRUE. This is also set as the plot title if 'wTitle' is set to TRUE.
All plots are arranged by the IQWs (smallest on top, largest at the bottom), even if 'iqw' is set to FALSE.
The plot(s) as a ggplot2 object. The order of the plots is IQW, followed by TPM values (when specified), followed by conservation scores (when specified).
Sarvesh Nikumbh
bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE, col.names = c("chr", "start", "end", "width", "dominant_ctss", "domTPM", "strand", "score", "nr_ctss", "q_0.1", "q_0.9", "IQW", "tpm")) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) iqw_tpm_pl <- iqw_tpm_plots(sname = "sample1", dir_path = tempdir(), info_df = info_df, iqw = TRUE, tpm = TRUE, cons = FALSE, clusts = use_clusts, txt_size = 14)
bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE, col.names = c("chr", "start", "end", "width", "dominant_ctss", "domTPM", "strand", "score", "nr_ctss", "q_0.1", "q_0.9", "IQW", "tpm")) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) iqw_tpm_pl <- iqw_tpm_plots(sname = "sample1", dir_path = tempdir(), info_df = info_df, iqw = TRUE, tpm = TRUE, cons = FALSE, clusts = use_clusts, txt_size = 14)
Order clusters by median or mean interquantile widths
order_clusters_iqw(sname, clusts, info_df, order_by_median = TRUE)
order_clusters_iqw(sname, clusts, info_df, order_by_median = TRUE)
sname |
The sample name |
clusts |
List of sequence ids in each cluster. |
info_df |
The data.frame with all tag clusters information. The following columns are expected in the data.frame:"chr", "start", "end", "width", "strand", "score", "nr_ctss", "dominant_ctss", "domTPM", "IQW", "tpm" and two additional columns based on qLow and qUp used. |
order_by_median |
Logical. Whether to order to clusters by their median (when TRUE) or mean (when FALSE) interquantile widths. |
The list of clusters ordered by their mean/median interquantile widths (shortest first).
Sarvesh Nikumbh
bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE, col.names = c("chr", "start", "end", "width", "dominant_ctss", "domTPM", "strand", "score", "nr_ctss", "q_0.1", "q_0.9", "IQW", "tpm")) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) ordered_clusts <- seqArchRplus::order_clusters_iqw( sname = "sample1", clusts = use_clusts, info_df = info_df, order_by_median = TRUE)
bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE, col.names = c("chr", "start", "end", "width", "dominant_ctss", "domTPM", "strand", "score", "nr_ctss", "q_0.1", "q_0.9", "IQW", "tpm")) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) ordered_clusts <- seqArchRplus::order_clusters_iqw( sname = "sample1", clusts = use_clusts, info_df = info_df, order_by_median = TRUE)
This function helps annotate the genomic regions specified in 'tc_gr' with features, namely, promoter-TSS (transcription start site), exons, 5'UTR, 3'UTR, introns and (distal) intergenic regions. This requires that the annotations are available as a TxDb object. The selected genomic regions can be specified as a single GenomicRanges object. These regions can be specified directly as a BED file (when available) or select specific regions from a larger set of regions based on some clustering.
When working with CAGE data, if the CAGEr package was used and the corresponding CAGEexp object is available, it can also be used – see 'cager_obj' argument.
per_cluster_annotations( sname = NULL, clusts = NULL, tc_gr = NULL, cager_obj = NULL, qLow = 0.1, qUp = 0.9, txdb_obj = NULL, tss_region = NULL, orgdb_obj = NULL, one_plot = TRUE, dir_path = NULL, txt_size = 12, use_suffix = NULL, use_prefix = "C", n_cores = 1 )
per_cluster_annotations( sname = NULL, clusts = NULL, tc_gr = NULL, cager_obj = NULL, qLow = 0.1, qUp = 0.9, txdb_obj = NULL, tss_region = NULL, orgdb_obj = NULL, one_plot = TRUE, dir_path = NULL, txt_size = 12, use_suffix = NULL, use_prefix = "C", n_cores = 1 )
sname |
Sample name. Default is NULL. This is a required argument if the CAGEexp object is provided. See 'cager_obj' argument |
clusts |
List of sequence IDs in each cluster. This can be NULL only when a BED file is passed to the argument 'tc_gr' |
tc_gr |
Tag clusters as |
cager_obj |
A CAGEexp object obtained from the CAGEr package, if and when CAGEr was used to process the raw CAGE data |
qLow , qUp
|
The interquantile boundaries to be considered for obtaining
tag clusters from the CAGEexp object. See |
txdb_obj |
A TxDb object storing transcript metadata |
tss_region |
For ChIPseeker |
orgdb_obj |
Organism-level annotation package |
one_plot |
Logical. Default is TRUE. If set to FALSE the barplots of annotations per cluster are returned as a list, else all are condensed into single plot |
dir_path |
Specify the /path/to/directory to store results |
txt_size |
Adjust text size for the plots |
use_suffix , use_prefix
|
Character. Specify any suffix and/or prefix you wish to add to the cluster labels |
n_cores |
Numeric. If you wish to parallelize annotation of peaks, specify the number of cores. Default is 1 (serial) |
When annotations for only selected clusters are required, alter the 'clusts' argument to specify only those selected clusters. Because the 'clusts' list holds the IDs of sequences belonging to each cluster, the corresponding records are selected from the 'tc_gr' GRanges object. This approach requires that sequence IDs in 'clusts' are directly associated with the ranges in 'tc_gr'. Also, see examples.
When 'one_plot = TRUE', a single plot where annotation barplots for each cluster are put together (ordered as per the clusters in 'clusts'). Otherwise, a list of annotation barplots is returned (again ordered by the clusters in 'clusts').
Sarvesh Nikumbh
## Need the TxDb object to run these examples if(require("TxDb.Dmelanogaster.UCSC.dm6.ensGene")){ library(GenomicRanges) library(TxDb.Dmelanogaster.UCSC.dm6.ensGene) library(ChIPseeker) bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE) tc_gr_from_df <- GenomicRanges::makeGRangesFromDataFrame(info_df, keep.extra.columns = TRUE) tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds", package = "seqArchRplus", mustWork = TRUE)) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) tdir <- tempdir() # Get annotations for all clusters in use_clusts annotations_pl <- per_cluster_annotations(sname = "sample1", clusts = use_clusts, tc_gr = tc_gr_from_df, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, one_plot = FALSE, dir_path = tdir, tss_region = c(-500,100)) # Get annotations for selected clusters in use_clusts # -- First two clusters selected_clusts <- lapply(seq(2), function(x) use_clusts[[x]]) # OR # -- Mixed set of clusters, say 1 and 3 out of total 3 selected_clusts <- lapply(c(1,3), function(x) use_clusts[[x]]) # annotations_pl <- per_cluster_annotations(sname = "sample1", clusts = selected_clusts, tc_gr = tc_gr, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, one_plot = FALSE, dir_path = tdir, tss_region = c(-500,100)) # Alternatively, you can also directly specify a BED file to the `tc_gr` # argument. This is useful when one may not have access to the CAGEexp # object, but only clusters' information is available in a BED file. # annotations_pl <- per_cluster_annotations(sname = "sample1", clusts = NULL, tc_gr = bed_fname, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, one_plot = FALSE, dir_path = tdir, tss_region = c(-500,100)) }
## Need the TxDb object to run these examples if(require("TxDb.Dmelanogaster.UCSC.dm6.ensGene")){ library(GenomicRanges) library(TxDb.Dmelanogaster.UCSC.dm6.ensGene) library(ChIPseeker) bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE) tc_gr_from_df <- GenomicRanges::makeGRangesFromDataFrame(info_df, keep.extra.columns = TRUE) tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds", package = "seqArchRplus", mustWork = TRUE)) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) tdir <- tempdir() # Get annotations for all clusters in use_clusts annotations_pl <- per_cluster_annotations(sname = "sample1", clusts = use_clusts, tc_gr = tc_gr_from_df, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, one_plot = FALSE, dir_path = tdir, tss_region = c(-500,100)) # Get annotations for selected clusters in use_clusts # -- First two clusters selected_clusts <- lapply(seq(2), function(x) use_clusts[[x]]) # OR # -- Mixed set of clusters, say 1 and 3 out of total 3 selected_clusts <- lapply(c(1,3), function(x) use_clusts[[x]]) # annotations_pl <- per_cluster_annotations(sname = "sample1", clusts = selected_clusts, tc_gr = tc_gr, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, one_plot = FALSE, dir_path = tdir, tss_region = c(-500,100)) # Alternatively, you can also directly specify a BED file to the `tc_gr` # argument. This is useful when one may not have access to the CAGEexp # object, but only clusters' information is available in a BED file. # annotations_pl <- per_cluster_annotations(sname = "sample1", clusts = NULL, tc_gr = bed_fname, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, one_plot = FALSE, dir_path = tdir, tss_region = c(-500,100)) }
This function helps identify GO terms enriched per cluster. This requires that the annotations are available as a TxDb object. The selected genomic regions can be specified as a single GenomicRanges object. These regions can be specified directly as a BED file (when available) or select specific regions from a larger set of regions based on some clustering.
per_cluster_go_term_enrichments( sname = NULL, clusts = NULL, tc_gr = NULL, cager_obj = NULL, qLow = 0.1, qUp = 0.9, txdb_obj = NULL, tss_region = NULL, orgdb_obj = NULL, use_keytype = "ENTREZID", one_file = TRUE, bar_or_dot = "dot", dir_path = NULL, txt_size = 12, n_cores = 1 )
per_cluster_go_term_enrichments( sname = NULL, clusts = NULL, tc_gr = NULL, cager_obj = NULL, qLow = 0.1, qUp = 0.9, txdb_obj = NULL, tss_region = NULL, orgdb_obj = NULL, use_keytype = "ENTREZID", one_file = TRUE, bar_or_dot = "dot", dir_path = NULL, txt_size = 12, n_cores = 1 )
sname |
Sample name. Default is NULL. This is a required argument if the CAGEexp object is provided. See 'cager_obj' argument |
clusts |
List of sequence IDs in each cluster. This can be NULL only when a BED file is passed to the argument 'tc_gr' |
tc_gr |
Tag clusters as |
cager_obj |
A CAGEexp object obtained from the CAGEr package, if and when CAGEr was used to process the raw CAGE data |
qLow , qUp
|
The interquantile boundaries to be considered for obtaining
tag clusters from the CAGEexp object. See |
txdb_obj |
A TxDb object storing transcript metadata |
tss_region |
For ChIPseeker |
orgdb_obj |
Organism-level annotation package |
use_keytype |
Either of "ENTREZID" or "ENSEMBL". Required for use with
|
one_file |
Logical. Default is TRUE. If set to FALSE the plots of GO terms enriched per cluster are returned as a list, else all are written to a single file as separate pages |
bar_or_dot |
Specify "dot" for dotplot (default), or "bar" for barplot |
dir_path |
Specify the /path/to/directory to store results |
txt_size |
Adjust text size for the plots |
n_cores |
For future use |
Both 'txdb_obj' and 'orgdb_obj' are required.
Per cluster, the enriched GO terms are visualized as a dot plot which shows the enriched terms on the vertical axis and the ratio of genes that are enriched for the given GO term vs. the total genes in the cluster.
The list of dot plots showing GO term enrichments per cluster. This is a list of ggplot2 plots.
When 'one_file' is set to TRUE (default), in addition to returning the list of dot plots, these plots are also written to disk as a PDF, with one plot per page.
Sarvesh Nikumbh
library(GenomicRanges) library(TxDb.Dmelanogaster.UCSC.dm6.ensGene) library(ChIPseeker) ## important to load this package library(org.Dm.eg.db) bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE) tc_gr_from_df <- GenomicRanges::makeGRangesFromDataFrame(info_df, keep.extra.columns = TRUE) tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds", package = "seqArchRplus", mustWork = TRUE)) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) tdir <- tempdir() # Get GO term enrichments for all clusters in use_clusts go_pl <- per_cluster_go_term_enrichments(sname = "sample1", clusts = use_clusts[1:2], tc_gr = tc_gr_from_df, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, dir_path = tdir, one_file = FALSE, tss_region = c(-500,100), orgdb_obj = "org.Dm.eg.db")
library(GenomicRanges) library(TxDb.Dmelanogaster.UCSC.dm6.ensGene) library(ChIPseeker) ## important to load this package library(org.Dm.eg.db) bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE) tc_gr_from_df <- GenomicRanges::makeGRangesFromDataFrame(info_df, keep.extra.columns = TRUE) tc_gr <- readRDS(system.file("extdata", "example_tc_gr.rds", package = "seqArchRplus", mustWork = TRUE)) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) tdir <- tempdir() # Get GO term enrichments for all clusters in use_clusts go_pl <- per_cluster_go_term_enrichments(sname = "sample1", clusts = use_clusts[1:2], tc_gr = tc_gr_from_df, txdb_obj = TxDb.Dmelanogaster.UCSC.dm6.ensGene, dir_path = tdir, one_file = FALSE, tss_region = c(-500,100), orgdb_obj = "org.Dm.eg.db")
Plot per cluster sequence logos
per_cluster_seqlogos( sname, seqs = NULL, clusts, pos_lab = NULL, bits_yax = "max", strand_sep = FALSE, one_plot = TRUE, info_df = NULL, txt_size = 12, save_png = FALSE, dir_path = NULL )
per_cluster_seqlogos( sname, seqs = NULL, clusts, pos_lab = NULL, bits_yax = "max", strand_sep = FALSE, one_plot = TRUE, info_df = NULL, txt_size = 12, save_png = FALSE, dir_path = NULL )
sname |
The sample name |
seqs |
The raw sequences as a |
clusts |
List of sequence ids in each cluster. |
pos_lab |
The position labels |
bits_yax |
The yaxis limits. Possible values are "full", and "auto".
See argument in |
strand_sep |
Logical. Whether sequences are to be separated by strand. |
one_plot |
Logical. Whether all sequence logos should be combined into one grid (with ncol = 1)? |
info_df |
The information data.frame |
txt_size |
The font size for text |
save_png |
Logical. Set TRUE if you would like to save the architectures sequence logos as PNG files |
dir_path |
The /path/to/the/directory where plot will be saved. Default is NULL |
Plots the sequence logos of all clusters
If 'one_plot' is TRUE, one plot as a grid of all sequence logos is returned.
If 'one_plot' is FALSE, a set of ggplot2-based sequence logos for all clusters is saved to disk with the default filename 'Architectures_0-max.pdf' in the location provided by 'dir_path'. This is a multi-page PDF document with the sequence logo for each cluster on a separate page. Also, the list of plots is returned.
With 'strand_sep = TRUE', the 'one_plot' option is not available. In other words, only a list of plots is returned. When 'dir_path' is not NULL, the default filename used is 'Architectures_0-max_strand_separated.pdf'.
Sarvesh Nikumbh
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) seqlogo_pl <- per_cluster_seqlogos(sname = "sample1", seqs = raw_seqs, clusts = use_clusts, pos_lab = -45:45, bits_yax = "auto", strand_sep = FALSE, one_plot = TRUE, dir_path = tempdir(), txt_size = 14)
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) seqlogo_pl <- per_cluster_seqlogos(sname = "sample1", seqs = raw_seqs, clusts = use_clusts, pos_lab = -45:45, bits_yax = "auto", strand_sep = FALSE, one_plot = TRUE, dir_path = tempdir(), txt_size = 14)
Visualize as barplots how promoters in each cluster are distributed on different chromosomes and strands
per_cluster_strand_dist( sname, clusts, info_df, dir_path = NULL, colrs = c("#FB8072", "#80B1D3"), txt_size = 14, fwidth = 20, fheight = 3 )
per_cluster_strand_dist( sname, clusts, info_df, dir_path = NULL, colrs = c("#FB8072", "#80B1D3"), txt_size = 14, fwidth = 20, fheight = 3 )
sname |
The sample name |
clusts |
List of sequence ids in each cluster. |
info_df |
The information data.frame |
dir_path |
Specify the path to the directory on disk where plots will be saved |
colrs |
Specify colors used for two strands. By default, the fourth and fifth color from the palette 'Set3' are used |
txt_size |
Size of the text in the plots (includes plot title, y-axis title, axis texts, legend title and text) |
fwidth |
Width of the individual plots in file |
fheight |
Height of the plots in file |
A list of plots showing the per cluster division of promoters on chromosomes and strands. These plots are also written to disk in file named "Per_cluster_strand_distributions.pdf"
Sarvesh Nikumbh
library(RColorBrewer) bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE, col.names = c("chr", "start", "end", "width", "strand", "score", "nr_ctss", "dominant_ctss", "domTPM", "q_0.1", "q_0.9", "IQW", "tpm")) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) pair_colrs <- RColorBrewer::brewer.pal(n = 5, name = "Set3")[4:5] per_cl_strand_pl <- per_cluster_strand_dist(sname = "sample1", clusts = use_clusts, info_df = info_df, dir_path = tempdir(), colrs = pair_colrs )
library(RColorBrewer) bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE, col.names = c("chr", "start", "end", "width", "strand", "score", "nr_ctss", "dominant_ctss", "domTPM", "q_0.1", "q_0.9", "IQW", "tpm")) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) pair_colrs <- RColorBrewer::brewer.pal(n = 5, name = "Set3")[4:5] per_cl_strand_pl <- per_cluster_strand_dist(sname = "sample1", clusts = use_clusts, info_df = info_df, dir_path = tempdir(), colrs = pair_colrs )
Plot heatmaps of motifs occurring in seqArchR clusters
plot_motif_heatmaps( sname, seqs, flanks = c(50), clusts, use_colors = NULL, motifs, dir_path, fheight = 500, fwidth = 500, funits = "px", n_cores = 1, res = 150, dev = "png" )
plot_motif_heatmaps( sname, seqs, flanks = c(50), clusts, use_colors = NULL, motifs, dir_path, fheight = 500, fwidth = 500, funits = "px", n_cores = 1, res = 150, dev = "png" )
sname |
The sample name |
seqs |
The sequences as a DNAStringSet object |
flanks |
Flank size. The same flank is used upstream and downstream. A vector of values is also accepted when more than one flanks should be visualized. |
clusts |
List of sequence Ids in each cluster. |
use_colors |
Specify colors to use |
motifs |
A vector of motifs to be visualized in the sequence. This can be any words formed by the IUPAC code. For example, TATAA, CG, WW, SS etc. |
dir_path |
The path to the directory |
fheight , fwidth , funits
|
Height and width of the image file, and the units in which they are specified. Units are inches for PDF and SVG, and pixels for PNG and TIFF devices |
n_cores |
Numeric. If you wish to parallelize, specify the number of cores. Default is 1 (serial) |
res |
Numeric. The resolution in ppi to be used for producing PNG or TIFF files. Default is 150 |
dev |
Specify the device type. One of 'png', 'tiff', 'pdf', or 'svg'. Default is 'png' |
Nothing. Images are written to disk using the provided filenames.
Sarvesh Nikumbh
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters200.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) plot_motif_heatmaps(sname = "sample1", seqs = raw_seqs, flanks = c(10, 20, 50), clusts = use_clusts, motifs = c("WW", "SS", "TATAA", "CG", "Y"), dir_path = tempdir(), fheight = 800, fwidth = 2400)
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters200.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) plot_motif_heatmaps(sname = "sample1", seqs = raw_seqs, flanks = c(10, 20, 50), clusts = use_clusts, motifs = c("WW", "SS", "TATAA", "CG", "Y"), dir_path = tempdir(), fheight = 800, fwidth = 2400)
This function uses the seqPattern package. It is recommended to use this function rather than ‘plot_motif_heatmaps’
plot_motif_heatmaps2( sname, seqs, flanks = c(50), clusts, use_colors = NULL, motifs, dir_path, fheight = 1.5 * fwidth, fwidth = 2000, hm_scale_factor = 0.75, n_cores = 1, type = c("png", "jpg"), ... )
plot_motif_heatmaps2( sname, seqs, flanks = c(50), clusts, use_colors = NULL, motifs, dir_path, fheight = 1.5 * fwidth, fwidth = 2000, hm_scale_factor = 0.75, n_cores = 1, type = c("png", "jpg"), ... )
sname |
The sample name |
seqs |
The sequences as a DNAStringSet object |
flanks |
Flank size. The same flank is used upstream and downstream. A vector of values is also accepted when more than oone flanks should be visualized |
clusts |
List of sequence Ids in each cluster |
use_colors |
Specify colors to use |
motifs |
A vector of motifs to be visualized in the sequence. This can be any words formed by the IUPAC code. For example, TATAA, CG, WW, SS etc. |
dir_path |
The path to the directory |
fheight , fwidth
|
Height and width of the individual heatmap plots in pixels |
hm_scale_factor |
Factor to scale the color scale bar w.r.t. the heatmaps. Values in (0,1]. Useful when specifying more than two motifs at once. Note that combining/specifying more than 3 or 4 motifs in one call to the function may result in a sub-optimally combined plot |
n_cores |
Numeric. If you wish to parallelize, specify the number of cores. Default is 1 (serial) |
type |
Specify either of "png" or "jpg" to obtain PNG or JPEG files as output |
... |
Additional arguments passed to
|
Nothing. Images are written to disk using the provided filenames. In addition, two legends are printed to separate files: the color legend and the clustering legend, which can be then combined with the heatmaps. The heatmaps themselves have the cluster numbers marked on the vertical axis.
Sarvesh Nikumbh
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters200.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) plot_motif_heatmaps2(sname = "sample1", seqs = raw_seqs, flanks = c(50), clusts = use_clusts, motifs = c("WW", "SS"), dir_path = tempdir())
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters200.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) plot_motif_heatmaps2(sname = "sample1", seqs = raw_seqs, flanks = c(50), clusts = use_clusts, motifs = c("WW", "SS"), dir_path = tempdir())
This package facilitates various downstream analyses and visualizations for clusters/architectures identified in promoter sequences. If a seqArchR result object is available, and in addition, the CAGEr object or information on the tag clusters, this package can be used for the following:
- Order the sequence architectures by the interquantile widths (IQWs) of the
tag clusters they originate from. See 'CAGEr' vignette for more information.
order_clusters_iqw
- Visualize distributions of IQW, TPM and conservation scores per cluster
iqw_tpm_plots
- Visualize the percentage annotations of promoter sequence per cluster
per_cluster_annotations
- Visualize heatmaps of occurrences of motifs (as words) in sequences
plot_motif_heatmaps
and plot_motif_heatmaps2
- Visualize the above plots as (publication ready) combined panels viewable for different samples as HTML reports
- Following per cluster visualizations:
- sequence logos of architectures (including strand-separated ones)
per_cluster_seqlogos
- distributions of promoters on different chromosomes/strands
per_cluster_strand_dist
- GO term enrichments
per_cluster_go_term_enrichments
- Produce BED track files of seqArchR clusters for visulization in a genome
browser or IGV
write_seqArchR_cluster_track_bed
- Curate seqArchR clusters
curate_clusters
- (future) Generate HTML reports that help you navigate this wealth of information with ease, and enable insights and hypotheses generation
Sarvesh Nikumbh
Visualize all sequences as an image
seqs_acgt_image( sname, seqs, seqs_ord, pos_lab, xt_freq = 5, yt_freq = 500, f_height = 1200, f_width = 600, dir_path = NULL )
seqs_acgt_image( sname, seqs, seqs_ord, pos_lab, xt_freq = 5, yt_freq = 500, f_height = 1200, f_width = 600, dir_path = NULL )
sname |
The sample name |
seqs |
The sequences |
seqs_ord |
The order of sequences |
pos_lab |
The position labels |
xt_freq |
The frequency of xticks |
yt_freq |
The frequency of yticks |
f_height , f_width
|
The height and width for the PNG image. |
dir_path |
Specify the /path/to/directory to store results. Default is NULL. |
Nothing. PNG images are written to disk at the provided 'dir_path' using the filename '<sample_name>_ClusteringImage.png'.
Sarvesh Nikumbh
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) seqs_acgt_image(sname = "sample1", seqs = raw_seqs, seqs_ord = unlist(use_clusts), pos_lab = -45:45, dir_path = tempdir())
library(Biostrings) raw_seqs <- Biostrings::readDNAStringSet( filepath = system.file("extdata", "example_promoters45.fa.gz", package = "seqArchRplus", mustWork = TRUE) ) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) seqs_acgt_image(sname = "sample1", seqs = raw_seqs, seqs_ord = unlist(use_clusts), pos_lab = -45:45, dir_path = tempdir())
Writes the seqArchR clusters as BED tracks for viewing in IGV or any genome browser
write_seqArchR_cluster_track_bed( sname, clusts = NULL, info_df, use_q_bound = TRUE, use_as_names = NULL, one_zip_all = FALSE, org_name = NULL, dir_path, include_in_report = FALSE, strand_sep = FALSE )
write_seqArchR_cluster_track_bed( sname, clusts = NULL, info_df, use_q_bound = TRUE, use_as_names = NULL, one_zip_all = FALSE, org_name = NULL, dir_path, include_in_report = FALSE, strand_sep = FALSE )
sname |
Sample name |
clusts |
List of sequence ids in each cluster |
info_df |
The data.frame holding information to be written to the BED file. Expected columns are "chr", "start", "end", "names", "strand", "domTPM", and "dominant_ctss". Out of these, only "dominant_ctss" column is optional; when present this is visualised as thickStart and thickEnd. The "domTPM" column is treated as scores for visualizing by default. One can also visualize PhastCons score |
use_q_bound |
Logical. Write the lower and upper quantiles as tag cluster boundaries. Default is TRUE |
use_as_names |
Specify the column name from info_df which you would like to use as names for display with the track. This is NULL by default, and the sequence/tag cluster IDs are used as names. |
one_zip_all |
Logical. Specify TRUE when the facility to download BED files for all clusters with one click is desired, otherwise FALSE. This is relevant only when include_in_report is TRUE. Default is FALSE |
org_name |
Organism name |
dir_path |
The /path/to/the/directory where a special folder named 'Cluster_BED_tracks' (by default) is created and all BED files are written inside it. This is a required argument, and cannot be NULL |
include_in_report |
Logical. Specify TRUE when this function is invoked to write BED files to disk *and* provide downloadable links in the HTML report. The corresponding chunk in the Rmarkdown report should set the parameter ‘result=’asis''. By setting this to FALSE, BED files are written to disk, but no downloadable links are provided. Note: This should be TRUE, when 'one_zip_all' argument is set to TRUE (see below). Requires the package 'xfun' |
strand_sep |
Logical. Specify TRUE if records for each strand are to be written in separate BED files |
Note on links in HTML: For providing downloadable links in the HTML report, the complete BED files are encoded into base64 strings and embedded with the HTML report itself. This considerably increases the size of the HTML file, and can slow down loading of the HTML file in your browser.
Note on BED files:
The output BED files have selected columns provided in the 'info_df'.
These are "chr", "start", "end", "name", "score" (see more info below),
"strand", "dominant_ctss".
By default, the sequence/tag cluster IDs are used as names.
If `use_as_names`
is specified, information from that column in the
'info_df' is used as "name".
If conservation score (e.g., PhastCons) is available, it is used as the score, otherwise the TPM value of the dominant CTSS (domTPM) is used. The final two columns (when dominantCTSS column is present), are the 'thickStart' and ‘thickEnd’ values corresponding to the BED format. The 'thickEnd' column is the dominant_ctss position.
Importantly, the lower and upper quantile boundaries are used as the start and end coordinates of the cluster when 'use_q_bound' is set to TRUE (the default).
When `include_in_report = FALSE`
, the cluster information is written
to disk as BED track files that can be viewed in the genome browser or IGV.
Otherwise, a str object holding HTML text is returned that can be included
in the report as downloadable links for each cluster BED file (use
`cat`
function).
When `one_zip_all = TRUE`
, a link to download all files zipped into
one is also provided to enable convenience.
Sarvesh Nikumbh
bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) # Write seqArchR clusters of promoters/CAGE tag clusters as BED track files # All possible variations are enlisted here # Using quantiles information as the tag cluster boundaries # via arg `use_q_bound = TRUE` and using custom names for each. # Create a new/custom column, and specify the new column name for argument # `use_as_names`. Notice that any custom names can be obtained by this # approach. info_df$use_names <- paste(rownames(info_df), info_df$domTPM, sep = "_") write_seqArchR_cluster_track_bed(sname = "sample1", clusts = use_clusts, info_df = info_df, use_q_bound = FALSE, use_as_names = "use_names", dir_path = tempdir() ) # Generating textual output that can be included in HTML reports. # This requires package xfun. cat_str <- write_seqArchR_cluster_track_bed(sname = "sample1", clusts = use_clusts, info_df = info_df, use_q_bound = FALSE, use_as_names = "use_names", dir_path = tempdir(), include_in_report = TRUE, one_zip_all = TRUE )
bed_fname <- system.file("extdata", "example_info_df.bed.gz", package = "seqArchRplus", mustWork = TRUE) info_df <- read.delim(file = bed_fname, sep = "\t", header = TRUE) use_clusts <- readRDS(system.file("extdata", "example_clust_info.rds", package = "seqArchRplus", mustWork = TRUE)) # Write seqArchR clusters of promoters/CAGE tag clusters as BED track files # All possible variations are enlisted here # Using quantiles information as the tag cluster boundaries # via arg `use_q_bound = TRUE` and using custom names for each. # Create a new/custom column, and specify the new column name for argument # `use_as_names`. Notice that any custom names can be obtained by this # approach. info_df$use_names <- paste(rownames(info_df), info_df$domTPM, sep = "_") write_seqArchR_cluster_track_bed(sname = "sample1", clusts = use_clusts, info_df = info_df, use_q_bound = FALSE, use_as_names = "use_names", dir_path = tempdir() ) # Generating textual output that can be included in HTML reports. # This requires package xfun. cat_str <- write_seqArchR_cluster_track_bed(sname = "sample1", clusts = use_clusts, info_df = info_df, use_q_bound = FALSE, use_as_names = "use_names", dir_path = tempdir(), include_in_report = TRUE, one_zip_all = TRUE )