Identifying human miRNA transcriptional start sites (TSSs) plays a significant role in understanding the transcriptional regulation of miRNA. However, due to the quick capping of pri-miRNA and many miRNA genes may lie in the introns or even exons of other genes, it is difficult to detect miRNA TSSs. miRNA TSSs are cell-specific. And miRNA TSSs are cell-specific, which implies the same miRNA in different cell-lines may start transcribing at different TSSs.
High throughput sequencing, like ChIP-seq, has gradually become an essential and versatile approach for us to identify and understand genomes and their transcriptional processes. By integrating H3k4me4 and Pol II data, parting of false positive counts after scoring can be filtered out. Besides, DNase I hypersensitive sites(DHS) also imply TSSs, where miRNAs will be accessible and functionally related to transcription activities. And additionally, the expression profile of miRNA and genes in certain cell-line will be considered as well to improve fidelity. By employing all these different kinds of data, here we have developed the primirTSS package to assist users to identify miRNA TSSs in human and to provide them with related information about where miRNA genes lie in the genome, with both command-line and graphical interfaces.
Install the latest release of R, then get primirTSS
by
starting R and entering the commands:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("primirTSS")
Or install the development version of the package using the same command, but with GitHub account name.
As Java development environment is indispensable for the primary
function in our package, it is necessary for users to
install Java
SE Development Kit 10 before using primirTSS
.
peak_merge()
: Merge one kind of peaks
(H3K4me3 or Pol II)H3K4me3 and Pol II data are key
points for accurate prediction with our method. If one of these two peak
data is input, before executing the main function find_TSS
,
the function peak_merge
should be used to merge adjacent
peaks whose distance between each other is less than n
base
pairs and return the merged peaks as an output.
library(primirTSS)
peak_df <- data.frame(chrom = c("chr1", "chr2", "chr1"),
chromStart = c(450, 460, 680),
chromEnd = c(470, 480, 710),
stringsAsFactors = FALSE)
peak <- as(peak_df, "GRanges")
peak_merge(peak, n =250)
#> GRanges object with 2 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 450-710 *
#> [2] chr2 460-480 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
peak_join()
: Join two kinds of peaks
(H3K4me3 and Pol II)If both of H3K4me3 and Pol II data, after separately merging these
two kinds of evidence first, peak_join
should be employed
to integrate H3K4me3 and Pol II peaks and return the result as
bed_merged
parameter for the main function
find_tss
.
peak_df1 <- data.frame(chrom = c("chr1", "chr1", "chr1", "chr2"),
start = c(100, 460, 600, 70),
end = c(200, 500, 630, 100),
stringsAsFactors = FALSE)
peak1 <- as(peak_df1, "GRanges")
peak_df2 <- data.frame(chrom = c("chr1", "chr1", "chr1", "chr2"),
start = c(160, 470, 640, 71),
end = c(210, 480, 700, 90),
stringsAsFactors = FALSE)
peak2 <- as(peak_df2, "GRanges")
peak_join(peak1, peak2)
#> GRanges object with 3 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 160-200 *
#> [2] chr1 470-480 *
#> [3] chr2 71-90 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
find_tss
is the primary function in the package. The
program will first score the candidate TSSs of miRNA and pick up the
best candidate in the first step of prediction, (where users can set
flanking_num
and threshold
).
There will be different circumstances where not all miRNA expression profiles, DHS data, protein-coding gene(‘gene’) expression profiles are available:
Circumstance 1: no miRNA expression data; then suggest DHS check and protein-coding gene check.
ignore_DHS_check
: If users do not have their own miRNA
expression profile, the function will employ all the miRNAs already
annotated in human, but we suggest using DHS data of the cell line from
ENCODE to check whether this miRNA is expressed in the
cell line or not as well as and all human gene expression profiles from
Ensemble to check the relative position of TSSs and
protein-coding genes to improve the accuracy of prediction.peakfile <- system.file("testdata", "HMEC_h3.csv", package = "primirTSS")
DHSfile <- system.file("testdata", "HMEC_DHS.csv", package = "primirTSS")
peak_h3 <- read.csv(peakfile, stringsAsFactors = FALSE)
DHS <- read.csv(DHSfile, stringsAsFactors = FALSE)
DHS <- as(DHS, "GRanges")
peak_h3 <- as(peak_h3, "GRanges")
peak <- peak_merge(peak_h3)
no_ownmiRNA <- find_tss(peak, ignore_DHS_check = FALSE,
DHS = DHS, allmirdhs_byforce = FALSE,
expressed_gene = "all",
allmirgene_byforce = FALSE,
seek_tf = FALSE)
Circumstance 2: miRNA expression data provided; then no need for DHS check but protein-coding gene check.
expressed_mir
: If users have their own miRNA expression
profiles, we will use the expressed miRNAs and we suggest not using DHS
data of the cell line or others to check the expression of miRNAs.But
the protein-coding gene check to check the relative position of TSSs and
protein-coding genes is necessary, which helps to verify the precision
of prediction.bed_merged <- data.frame(
chrom = c("chr1", "chr1", "chr1", "chr1", "chr2"),
start = c(9910686, 9942202, 9996940, 10032962, 9830615),
end = c(9911113, 9944469, 9998065, 10035458, 9917994),
stringsAsFactors = FALSE)
bed_merged <- as(bed_merged, "GRanges")
expressed_mir <- c("hsa-mir-5697")
ownmiRNA <- find_tss(bed_merged, expressed_mir = expressed_mir,
ignore_DHS_check = TRUE,
expressed_gene = "all",
allmirgene_byforce = TRUE,
seek_tf = FALSE)
expressed_gene
: Additionally, users can also specify
certain genes expressed in the cell-line being analyzed:seek_tf = TRUE
: If user want to predict transcriptional
regulation relationship between TF and miRNA, like which TFs might
regulate miRNA after get TSSs, they can change
seek_tf = FALSE
from seek_tf = TRUE
directly
in the comprehensive function find_TSS()
.Here is a demo of predicting TSS for hsa-mir-5697, ignore DHS check.
PART1, $tss_df
:
ownmiRNA$tss_df
#> # A tibble: 1 × 10
#> mir_name chrom stem_loop_p1 stem_loop_p2 strand mir_context tss_type gene
#> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 hsa-mir-5697 chr1 9967381 9967458 + intra host_TSS ENSG…
#> # ℹ 2 more variables: predicted_tss <dbl>, pri_tss_distance <dbl>
The first part of the result returns details of predicted TSSs, composed of seven columns: mir_name, chrom, stem_loop_p1, stem_loop_p2, strand mir_context, tss_type gene and predicted_tss:
Entry | Implication |
---|---|
mir_name | Name of miRNA. |
chrom | Chromosome. |
stem_loop_p1 | The start site of a stem-loop. |
stem_loop_p2 | The end site of a stem-loop. |
strand | Polynucleotide strands. (+/- ) |
mir_context | 2 types of relative position relationship between stem-loop and
protein-coding gene. (intra/inter ) |
tss_type | 4 types of predicted TSSs. See the section below TSS types for
details.(host_TSS/intra_TSS/overlap_inter_TSS/inter_TSS ) |
gene | Ensembl gene ID. |
predicted_tss | Predicted transcription start sites(TSSs). |
pri_tss_distance | The distance between a predicted TSS and the start site of the stem-loop. |
TSSs are cataloged into 4 types as below:
host_TSS: The TSSs of miRNA that are close to
the TSS of protein-coding gene implying they may share the same TSS, on
the condition where mir_context = intra
. (See above:
mir_context
)
intra_TSS: The TSSs of miRNA that are NOT close
to the TSS of protein-coding gene, on the condition where
mir_context = intra
.
overlap_inter_TSS: The TSSs of miRNA are
cataloged as overlap_inter_TSS
when the pri-miRNA gene
overlaps with Ensembl gene, on the condition where
“mir_context = inter
”.
inter_inter_TSS: The TSSs of miRNA are cataloged
as inter_inter_TSS
when the miRNA gene does NOT overlap
with Ensembl gene, on the condition where
“mir_context = inter
”.
(See Xu HUA et al 2016 for more details)
PART2, $log
:
The second part of the result returns 4 logs created during the process of prediction:
find_nearest_peak_log
: If no peaks
locate in the upstream of a stem-loop to help determine putative TSSs of
miRNA, we will fail to find the nearest peak, and this miRNA will be
logged in find_nearest_peak_log
.
eponine_score_log
: For a certain
miRNA, if none of the candidate TSSs scored with Eponine method meet the
threshold we set, we will fail to get an eponine score, and this miRNA
will be logged in eponine_score_log
.
DHS_check_log
: For a certain miRNA,
if no DHS signals locate within 1 kb upstream of each putative TSSs,
these putative TSSs will be filtered out, and this miRNA will be logged
in DHS_check_log
.
gene_filter_log
: For a certain
miRNA, when integrating expressed_gene data to improve prediction, if no
putative TSSs are confirmed after considering the relative position
relationship among TSSs, stem-loops and expressed genes, this miRNA will
be filtered out and logged in gene_filter_log
.
plot_primiRNA()
: Apart from returning the putative TSS
of each miRNA, the package primirTSS
can also visualize the
result and return an image composed of six tracks, (1)TSS, (2)genome,
(3)pri-miRNA, (4)the closest gene, (5)eponine score and (6)conservation
score. And the parameters in this function is almost the same as those
in find_tss()
except expressed_mir
only
represents one certain miRNA in plot_primiRNA()
.
NOTICE that this function is used for visualizing the
TSS prediction of only one specific miRNA every single
time.plot_primiRNA(expressed_mir, bed_merged,
flanking_num = 1000, threshold = 0.7,
ignore_DHS_check = TRUE,
DHS, allmirdhs_byforce = TRUE,
expressed_gene = "all",
allmirgene_byforce = TRUE)
Figure S1. Visualized result for miRNA TSSs by
Plot pri-miRNA TSS()
As Figure S1 shows, the picture contains information of the pri-miRNA’s coordinate, the closest gene to the miRNA, the eponine score of the miRNA’s candidate TSS and the conservation score of the miRNA’s candidate TSS. There are six tracks plotted in return:
Entry Implication Chromosome Position of miRNA on the chromosome. hg19 Reference genome coordinate in hg19. pri-miRNA: Position of pri-miRNA. Ensemble genes Position of related protein-coding gene. Eponine score Score of best putative TSS by Eponine method. Conservation score Conservation score of TSS.
run_primirTSSapp()
: A graphical web interface is
designed to achieve the functions of find_tss
and
plot_primiRNA
to help users intuitively and conveniently
predict putative TSSs of miRNA. Users can refer documents of the two
functions, Find the best putative TSS and Plot
the prediction of TSSs for miRNA, mentioned above for details.
Figure S2. Graphical web interface of
Find pri-miRNA TSS()
As Figure S2 shows, if we want to use the shiny app, we should select the appropriate options or upload the appropriate files. Histone peaks, Pol II peaks and DHS files are comma-separated values (CSV) files, whose first line is chrom,start,end. Every line of miRNA expression profiles has only one miRNA name which start with hsa-mir, such as hsa-mir-5697. Every line of gene expression profiles has only one gene name which derived from Ensembl, such as ENSG00000261657. All of miRNA expression profiles and gene expression profiles do not have column names. If we have prepared, we can push the Start the analysis button to start finding the TSSs. The process of analysis may need to take a few minutes, and a process bar will appear in right corner.
As a result, we will view first six rows of the result. The first five columns are about miRNA information, next five columns are about TSS information. The column of gene denotes the gene whose TSS is closest to the miRNA TSS. The column of pri_tss_distance denotes the distance between miRNA TSS and stem-loop. If users choose to get TFs simultaneously, they will have an additional column,
tf
, which stores related TFs.
Figure S3. Graphical web interface of
Plot pri-miRNA TSS()
As Figure S4 shows, if we select the appropriate options and upload the appropriate files, we can have a picture of miRNA TSSs.
Here is the output of sessionInfo() on the system on which this document was compiled:
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] primirTSS_1.25.0 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 dplyr_1.1.4
#> [3] blob_1.2.4 filelock_1.0.3
#> [5] R.utils_2.12.3 Biostrings_2.75.3
#> [7] bitops_1.0-9 fastmap_1.2.0
#> [9] RCurl_1.98-1.16 BiocFileCache_2.15.0
#> [11] GenomicAlignments_1.43.0 XML_3.99-0.17
#> [13] digest_0.6.37 lifecycle_1.0.4
#> [15] KEGGREST_1.47.0 RSQLite_2.3.9
#> [17] magrittr_2.0.3 compiler_4.4.2
#> [19] rlang_1.1.4 sass_0.4.9
#> [21] tools_4.4.2 utf8_1.2.4
#> [23] yaml_2.3.10 BSgenome.Hsapiens.UCSC.hg38_1.4.5
#> [25] rtracklayer_1.67.0 knitr_1.49
#> [27] S4Arrays_1.7.1 phastCons100way.UCSC.hg38_3.7.1
#> [29] bit_4.5.0.1 curl_6.0.1
#> [31] DelayedArray_0.33.3 abind_1.4-8
#> [33] BiocParallel_1.41.0 HDF5Array_1.35.2
#> [35] withr_3.0.2 purrr_1.0.2
#> [37] R.oo_1.27.0 BiocGenerics_0.53.3
#> [39] sys_3.4.3 grid_4.4.2
#> [41] stats4_4.4.2 Rhdf5lib_1.29.0
#> [43] SummarizedExperiment_1.37.0 cli_3.6.3
#> [45] crayon_1.5.3 GenomicScores_2.19.0
#> [47] generics_0.1.3 httr_1.4.7
#> [49] rjson_0.2.23 DBI_1.2.3
#> [51] cachem_1.1.0 rhdf5_2.51.1
#> [53] stringr_1.5.1 zlibbioc_1.52.0
#> [55] parallel_4.4.2 AnnotationDbi_1.69.0
#> [57] BiocManager_1.30.25 XVector_0.47.1
#> [59] restfulr_0.0.15 matrixStats_1.4.1
#> [61] vctrs_0.6.5 Matrix_1.7-1
#> [63] jsonlite_1.8.9 IRanges_2.41.2
#> [65] S4Vectors_0.45.2 bit64_4.5.2
#> [67] maketools_1.3.1 tidyr_1.3.1
#> [69] jquerylib_0.1.4 glue_1.8.0
#> [71] codetools_0.2-20 JASPAR2018_1.1.1
#> [73] stringi_1.8.4 BiocVersion_3.21.1
#> [75] GenomeInfoDb_1.43.2 BiocIO_1.17.1
#> [77] GenomicRanges_1.59.1 UCSC.utils_1.3.0
#> [79] tibble_3.2.1 pillar_1.10.0
#> [81] rappdirs_0.3.3 htmltools_0.5.8.1
#> [83] rhdf5filters_1.19.0 GenomeInfoDbData_1.2.13
#> [85] BSgenome_1.75.0 R6_2.5.1
#> [87] dbplyr_2.5.0 evaluate_1.0.1
#> [89] lattice_0.22-6 Biobase_2.67.0
#> [91] AnnotationHub_3.15.0 png_0.1-8
#> [93] R.methodsS3_1.8.2 Rsamtools_2.23.1
#> [95] memoise_2.0.1 bslib_0.8.0
#> [97] SparseArray_1.7.2 xfun_0.49
#> [99] MatrixGenerics_1.19.0 buildtools_1.0.0
#> [101] pkgconfig_2.0.3