Alternative polyadenylation (APA) is one of the most important post-transcriptional regulation mechanisms which is prevalent in Eukaryotes. Like alternative splicing, APA can increase transcriptome diversity. In addition, it defines 3’ UTR and results in altered expression of the gene. It is a tightly controlled process and mis-regulation of APA can affect many biological processed, such as uncontrolled cell cycle and growth. Although several high throughput sequencing methods have been developed, there are still limited data dedicated to identifying APA events.
However, massive RNA-seq datasets, which were originally created to quantify genome-wide gene expression, are available in public databases such as GEO and TCGA. These RNA-seq datasets also contain information of genome-wide APA. Thus, we developed the InPAS package for identifying APA from the conventional RNA-seq data.
The major procedures in InPAS workflow are as follows:
In addition, the InPAS package also provide functions to perform quality control over RNA-seq data coverage, visualize differential usage of proximal and distal CP sites for genes of interest, and prepare essential files for gene set enrichment analysis (GSEA) to reveal biological insights from genes with alternative CP sites.
First, load the required packages, including InPAS, and species-specific genome and genome annotation database: BSgenome, TxDb and EnsDb.
logger <- file(tempfile(), open = "wt")
sink(logger, type="message")
suppressPackageStartupMessages({
library(InPAS)
library(BSgenome.Mmusculus.UCSC.mm10)
library(BSgenome.Hsapiens.UCSC.hg19)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(EnsDb.Hsapiens.v86)
library(EnsDb.Mmusculus.v79)
library(cleanUpdTSeq)
library(RSQLite)
library(future.apply)
})
Seven tables are created in the database. Table “metadata” stores the metadata, including information for tag (sample name), condition (experimental treatment group), bedgraph_file (paths to BEDGraph files), and depth (whole genome coverage depth) which is initially set to zeros and later updated during analysis. Tables “sample_coverage”, “chromosome_coverage”, “total_coverage”, “utr3_total_coverage”, “CPsites”, and “utr3cds_coverage” store names of intermediate files and the chromosome and tag (sample name) relevant to the files.
plan(sequential)
data_dir <- system.file("extdata", package = "InPAS")
bedgraphs <- c(file.path(data_dir, "Baf3.extract.bedgraph"),
file.path(data_dir, "UM15.extract.bedgraph"))
hugeData <- FALSE
genome <- BSgenome.Mmusculus.UCSC.mm10
tags <- c("Baf3", "UM15")
metadata <- data.frame(tag = tags,
condition = c("Baf3", "UM15"),
bedgraph_file = bedgraphs)
## In reality, don't use a temporary directory for your analysis. Instead, use a
## persistent directory to save your analysis output.
outdir = tempdir()
write.table(metadata, file = file.path(outdir =outdir, "metadata.txt"),
sep = "\t", quote = FALSE, row.names = FALSE)
sqlite_db <- setup_sqlitedb(metadata = file.path(outdir = outdir,
"metadata.txt"),
outdir = outdir)
## check the database
db_conn <- dbConnect(drv = RSQLite::SQLite(), dbname = sqlite_db)
dbListTables(db_conn)
## [1] "CPsites" "chromosome_coverage" "genome_anno"
## [4] "metadata" "sample_coverage" "total_coverage"
## [7] "utr3_total_coverage" "utr3cds_coverage"
## tag condition
## 1 Baf3 Baf3
## 2 UM15 UM15
## bedgraph_file depth
## 1 /tmp/RtmptiLWrV/Rinst211a3d9e7a9/InPAS/extdata/Baf3.extract.bedgraph 0
## 2 /tmp/RtmptiLWrV/Rinst211a3d9e7a9/InPAS/extdata/UM15.extract.bedgraph 0
3’ UTR annotation, including start and end coordinates, and strand
information of 3’ UTRs, last CDS and the gaps between 3’ extremities of
3’ UTRs and immediate downstream exons, is extracted using the function
extract_UTR3Anno
from genome annotation databases: a TxDb
database and an Ensembldb database for a species of interest. For
demonstration, the following snippet of R scripts shows how to extract
3’ UTR annotation from a abridged TxDb for a human reference genome
(hg19). In reality, users should use a TxDb for the most reliable genome
annotation of the PRIMARY reference genome assembly (NOT including the
alternative patches) used for RNA-seq read alignment. If a TxDb is not
available for the species of interest, users can build one using the
function makeTxDbFromUCSC, makeTxDbFromBiomart, makeTxDbFromEnsembl, or
makeTxDbFromGFF from the GenomicFeatures package, depending on the
sources of the genome annotation file.
samplefile <- system.file("extdata",
"hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
TxDb <- loadDb(samplefile)
seqnames <- seqnames(BSgenome.Hsapiens.UCSC.hg19)
# exclude mitochondrial genome and alternative haplotypes
chr2exclude <- c("chrM", "chrMT", seqnames[grepl("_(hap\\d+|fix|alt)$", seqnames, perl = TRUE)])
# set up global variables for InPAS analysis
set_globals(genome = BSgenome.Hsapiens.UCSC.hg19,
TxDb = TxDb,
EnsDb = EnsDb.Hsapiens.v86,
outdir = tempdir(),
chr2exclude = chr2exclude,
lockfile = tempfile())
## Setting default genome to hg19.
## Setting default EnsDb to EnsDb.
## Setting default TxDb to TxDb.
utr3_anno <-
extract_UTR3Anno(sqlite_db = sqlite_db,
TxDb = getInPASTxDb(),
edb = getInPASEnsDb(),
genome = getInPASGenome(),
outdir = getInPASOutputDirectory(),
chr2exclude = getChr2Exclude())
## Warning: Unable to map 3 of 42 requested IDs.
## GRanges object with 6 ranges and 8 metadata columns:
## seqnames ranges
## <Rle> <IRanges>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 chr1 32673684-32674288
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 chr1 153333315-153333503
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 chr1 155719929-155720673
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 chr1 165533062-165533185
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 chr1 207245717-207251162
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 chr1 207252365-207254368
## strand | exon_rank transcript
## <Rle> | <integer> <character>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 + | 5 uc001bum.2
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 + | 3 uc001fbq.3
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 + | 13 uc031pqa.1
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 + | 2 uc001gde.2
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 + | 15 uc001hfg.3
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 + | 15 uc001hfh.3
## feature gene
## <character> <character>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 utr3 55721
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 utr3 6280
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 utr3 100129405
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 utr3 440699
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 utr3 5208
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 utr3 5208
## exon symbol
## <character> <character>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 chr1:lastutr3:uc001b.. IQCC
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 chr1:lastutr3:uc001f.. S100A9
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 chr1:lastutr3:uc031p.. 100129405
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 chr1:lastutr3:uc001g.. LRRC52
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 chr1:lastutr3:uc001h.. PFKFB2
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 chr1:lastutr3:uc001h.. PFKFB2
## annotatedProximalCP truncated
## <character> <logical>
## chr1:lastutr3:uc001bum.2_5|IQCC|utr3 unknown FALSE
## chr1:lastutr3:uc001fbq.3_3|S100A9|utr3 unknown FALSE
## chr1:lastutr3:uc031pqa.1_13|100129405|utr3 proximalCP_155720479 FALSE
## chr1:lastutr3:uc001gde.2_2|LRRC52|utr3 unknown FALSE
## chr1:lastutr3:uc001hfg.3_15|PFKFB2|utr3 unknown FALSE
## chr1:lastutr3:uc001hfh.3_15|PFKFB2|utr3 unknown FALSE
## -------
## seqinfo: 83 sequences from an unspecified genome
This vignette will use the prepared 3’ UTR annotation for the mouse reference genome mm10 for subsequent demonstration
## set global variables for mouse InPAS analysis
set_globals(genome = BSgenome.Mmusculus.UCSC.mm10,
TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
EnsDb = EnsDb.Mmusculus.v79,
outdir = tempdir(),
chr2exclude = "chrM",
lockfile = tempfile())
## Setting default genome to mm10.
## Setting default EnsDb to EnsDb.
## Setting default TxDb to TxDb.
tx <- parse_TxDb(sqlite_db = sqlite_db,
TxDb = getInPASTxDb(),
edb = getInPASEnsDb(),
genome = getInPASGenome(),
outdir = getInPASOutputDirectory(),
chr2exclude = getChr2Exclude())
## Warning: Unable to map 6032 of 24594 requested IDs.
Before this step, genome coverage in the BEDGraph format should be
prepared from BAM files resulted from RNA-seq data alignment using the
genomecov
command in the BEDTools suite. BAM files can be
filtered to remove multi-mapping alignments, alignments with low mapping
quality and so on. Commands for reference are as follows:
## for single end RNA-seq data aligned with STAR
## -q 255, unique mapping
samtools view -bu -h -q 255 /path/to/XXX.SE.bam | \
bedtools genomecov -ibam - -bga -split > XXX.SE.uniq.bedgraph
## for paired-end RNA-seq data aligned with STAR
samtools view -bu -h -q 255 /path/to/XXX.PE.bam | \
bedtools genomecov -ibam - -bga -split > XXX.PE.uniq.bedgraph
The genome coverage data in the BEDGraph formatis converted into R
objects of Rle-class using the get_ssRleCov
function for
each chromosome of each sample. Rle objects for each individual
chromosome are save to outdir
. The filename, tag (sample
name), and chromosome name are save to Table “sample_coverage”.
Subsequently, chromosome-specific Rle objects for all samples are
assemble together into a two-level list of Rle objects, with level 1
being the chromosome name and level 2 being Rle for each tag (sample
name). Notably, the sample BEDGraph files used here only contain
coverage data for “chr6” of the mouse reference genome mm10.
coverage <- list()
for (i in seq_along(bedgraphs)){
coverage[[tags[i]]] <- get_ssRleCov(bedgraph = bedgraphs[i],
tag = tags[i],
genome = genome,
sqlite_db = sqlite_db,
outdir = outdir,
chr2exclude = getChr2Exclude())
}
coverage <- assemble_allCov(sqlite_db,
seqname = "chr6",
outdir,
genome = getInPASGenome())
At this point, users can check the data quality in terms of coverage
for all and expressed genes and 3’ UTRs using
run_coverageQC
. This function output summarized coverage
metrics: gene.coverage.rate, expressed.gene.coverage.rate,
UTR3.coverage.rate, and UTR3.expressed.gene.subset.coverage.rate. The
coverage rate of quality data should be greater than 0.75 for 3’ UTRs of
expressed genes.
if (.Platform$OS.type == "windows")
{
plan(multisession)
} else {
plan(multicore)
}
run_coverageQC(sqlite_db,
TxDb = getInPASTxDb(),
edb = getInPASEnsDb(),
genome = getInPASGenome(),
chr2exclude = getChr2Exclude(),
which = GRanges("chr6",
ranges = IRanges(98013000, 140678000)))
## strand information will be ignored.
## gene.coverage.rate expressed.gene.coverage.rate UTR3.coverage.rate
## Baf3 0.003463505 0.5778441 0.01419771
## UM15 0.003428528 0.5719748 0.01405159
## UTR3.expressed.gene.subset.coverage.rate
## Baf3 0.8035821
## UM15 0.7953112
depth weight, Z-score cutoff thresholds, and total coverage along 3’
UTRs merged across biological replicates within each condition (huge
data) or individual sample (non-huge data) are returned by the
setup_CPsSearch
function. Potential novel CP sites are
identified for each chromosome using the search_CPs
function. These potential CP sites can be filtered and/or adjusted using
the Naive Bayes classifier provided by cleanUpdTseq and/or by
using the polyadenylation scores by simply matching the position-weight
matrix (PWM) for the hexamer polyadenylation signal (AAUAAA and the
like).
## load the Naive Bayes classifier model for classify CP sites from the
## cleanUpdTseq package
data(classifier)
prepared_data <- setup_CPsSearch(sqlite_db,
genome = getInPASGenome(),
chr.utr3 = utr3.mm10$chr6,
seqname = "chr6",
background = "10K",
TxDb = getInPASTxDb(),
hugeData = TRUE,
outdir = outdir,
silence = TRUE,
coverage_threshold = 5)
CPsites <- search_CPs(seqname = "chr6",
sqlite_db = sqlite_db,
genome = getInPASGenome(),
MINSIZE = 10,
window_size = 100,
search_point_START = 50,
search_point_END = NA,
cutEnd = 0,
filter.last = TRUE,
adjust_distal_polyA_end = TRUE,
long_coverage_threshold = 2,
PolyA_PWM = NA,
classifier = classifier,
classifier_cutoff = 0.8,
shift_range = 50,
step = 5,
outdir = outdir,
silence = TRUE)
## No readable configuration file found
## Created registry in '/tmp/Rtmp1jlDUR/006.CPsites.out_chr6' using cluster functions 'Interactive'
## Adding 1 jobs ...
## Submitting 1 jobs in 1 chunks using cluster functions 'Interactive' ...
Estimate usage of proximal and distal CP sites based on read coverage along the short and long 3’ UTRs
InPAS provides the function test_dPDUI
to identify
differential usage of proximal and distal CP sites between different
conditions leveraging different statistical models according to the
experimental design. InPAS offers statistical methods for single sample
differential PDUI analysis, and single group analysis. Additionally,
InPAS provides Fisher exact test for two-group unreplicated design, and
empirical Bayes linear model leveraging the limma package for more
complex design. The test results can be further filtered using the
filter_testOut
function based on the fraction samples
within each condition with coverage data for the identified differential
PDUI events, and/or cutoffs of nominal p-values, adjusted p-values or
log2 (fold change).
InPAS package also provide functions, get_usage4plot
,
plot_utr3Usage
, and setup_GSEA
, to visualize
differential usage of proximal and distal CP sites for genes of
interest, and prepare essential files for gene set enrichment analysis
(GSEA) to reveal biological insights from genes with alternative CP
sites.
## Visualize dPDUI events
gr <- GRanges("chr6", IRanges(128846245, 128850081), strand = "-")
names(gr) <- "128846245-128850081"
data4plot <- get_usage4plot(gr,
proximalSites = 128849130,
sqlite_db,
hugeData = TRUE)
plot_utr3Usage(usage_data = data4plot,
vline_color = "purple",
vline_type = "dashed")
R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC tzcode source: system (glibc)
attached base packages: [1] stats4 stats graphics grDevices utils
datasets methods
[8] base
other attached packages: [1] future.apply_1.11.3
[2] future_1.34.0
[3] RSQLite_2.3.7
[4] cleanUpdTSeq_1.45.0
[5] BSgenome.Drerio.UCSC.danRer7_1.4.0
[6] EnsDb.Mmusculus.v79_2.99.0
[7] EnsDb.Hsapiens.v86_2.99.0
[8] ensembldb_2.29.1
[9] AnnotationFilter_1.31.0
[10] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 [11]
GenomicFeatures_1.57.1
[12] AnnotationDbi_1.69.0
[13] Biobase_2.67.0
[14] BSgenome.Hsapiens.UCSC.hg19_1.4.3
[15] BSgenome.Mmusculus.UCSC.mm10_1.4.3
[16] BSgenome_1.75.0
[17] rtracklayer_1.65.0
[18] BiocIO_1.17.0
[19] Biostrings_2.75.0
[20] XVector_0.45.0
[21] GenomicRanges_1.57.2
[22] GenomeInfoDb_1.41.2
[23] IRanges_2.39.2
[24] S4Vectors_0.43.2
[25] BiocGenerics_0.53.0
[26] InPAS_2.15.0
[27] BiocStyle_2.35.0
loaded via a namespace (and not attached): [1] sys_3.4.3
jsonlite_1.8.9
[3] magrittr_2.0.3 farver_2.1.2
[5] rmarkdown_2.28 fs_1.6.4
[7] zlibbioc_1.51.2 vctrs_0.6.5
[9] memoise_2.0.1 Rsamtools_2.21.2
[11] RCurl_1.98-1.16 htmltools_0.5.8.1
[13] S4Arrays_1.5.11 progress_1.2.3
[15] curl_5.2.3 truncnorm_1.0-9
[17] SparseArray_1.5.45 sass_0.4.9
[19] parallelly_1.38.0 bslib_0.8.0
[21] plyr_1.8.9 cachem_1.1.0
[23] buildtools_1.0.0 GenomicAlignments_1.41.0
[25] lifecycle_1.0.4 pkgconfig_2.0.3
[27] Matrix_1.7-1 R6_2.5.1
[29] fastmap_1.2.0 GenomeInfoDbData_1.2.13
[31] MatrixGenerics_1.17.1 digest_0.6.37
[33] colorspace_2.1-1 base64url_1.4
[35] labeling_0.4.3 fansi_1.0.6
[37] httr_1.4.7 abind_1.4-8
[39] compiler_4.4.1 proxy_0.4-27
[41] bit64_4.5.2 withr_3.0.2
[43] brew_1.0-10 backports_1.5.0
[45] BiocParallel_1.41.0 DBI_1.2.3
[47] highr_0.11 MASS_7.3-61
[49] depmixS4_1.5-0 rappdirs_0.3.3
[51] DelayedArray_0.31.14 rjson_0.2.23
[53] tools_4.4.1 nnet_7.3-19
[55] glue_1.8.0 restfulr_0.0.15
[57] nlme_3.1-166 grid_4.4.1
[59] checkmate_2.3.2 reshape2_1.4.4
[61] ade4_1.7-22 generics_0.1.3
[63] seqinr_4.2-36 gtable_0.3.6
[65] tzdb_0.4.0 flock_0.7
[67] class_7.3-22 preprocessCore_1.67.1
[69] data.table_1.16.2 hms_1.1.3
[71] utf8_1.2.4 pillar_1.9.0
[73] stringr_1.5.1 vroom_1.6.5
[75] limma_3.61.12 batchtools_0.9.17
[77] dplyr_1.1.4 lattice_0.22-6
[79] bit_4.5.0 tidyselect_1.2.1
[81] maketools_1.3.1 knitr_1.48
[83] ProtGenerics_1.37.1 SummarizedExperiment_1.35.5 [85] xfun_0.48
statmod_1.5.0
[87] matrixStats_1.4.1 Rsolnp_1.16
[89] stringi_1.8.4 UCSC.utils_1.1.0
[91] lazyeval_0.2.2 yaml_2.3.10
[93] evaluate_1.0.1 codetools_0.2-20
[95] tibble_3.2.1 BiocManager_1.30.25
[97] cli_3.6.3 munsell_0.5.1
[99] jquerylib_0.1.4 Rcpp_1.0.13
[101] globals_0.16.3 png_0.1-8
[103] XML_3.99-0.17 parallel_4.4.1
[105] ggplot2_3.5.1 readr_2.1.5
[107] blob_1.2.4 prettyunits_1.2.0
[109] plyranges_1.25.0 bitops_1.0-9
[111] listenv_0.9.1 scales_1.3.0
[113] e1071_1.7-16 crayon_1.5.3
[115] rlang_1.1.4 KEGGREST_1.45.1