---
title: "Estimating Transcription Rates with STADyUM from PRO-seq Data"
author: "Xin Zeng"
date: "2026/04/01"
output:
  html_document:
    toc: true
    toc_depth: 3
    theme: default
    highlight: tango
    fig_retina: 1
vignette: >
  %\VignetteIndexEntry{Estimating Transcription Rates with STADyUM from PRO-seq Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

# Introduction

This tutorial demonstrates how to use the **STADyUM** package to estimate transcription rates from PRO-seq data. STADyUM (Statistical Analysis of Dynamics Under Multiple Measurements) models RNA polymerase dynamics by analyzing reads at promoter-proximal pause sites and gene bodies. Refer to @Zhao2023 and @Siepel2022 for more details on modeling the transcription rates from real PRO-seq data.

In this workflow, we will:

1. Identify transcription start sites (TSSs) for each gene by keeping the TSS with the maximum score in the annotations
2. Construct read-counting regions for pause sites and gene bodies
3. Apply STADyUM to estimate key kinetic parameters e(release rate, elongation rate, activity)
4. Visualize the relationships between estimated transcription parameters

This example uses human PRO-seq data from the human CD4 cell type to demonstrate transcription rate estimation

***

# Setup

## Load Required Libraries

```{r setup, message=FALSE, warning=FALSE}
has_connectivity <- requireNamespace("BiocFileCache", quietly = TRUE) &&
    curl::has_internet()
knitr::opts_chunk$set(dpi = 72, fig.width = 5, fig.height = 4,
                      eval = has_connectivity)
if (!has_connectivity)
    message("No internet connection; vignette code not evaluated.")

library(STADyUM)           # For transcription rate estimation
library(GenomicRanges)     # For genomic data manipulation
library(dplyr)
library(plyranges)
#library(tidyverse)         # For data manipulation and visualization
library(ggplot2)           # For plotting
library(pracma)
library(BiocFileCache)
```

```{r file-paths}

bfc <- BiocFileCache(ask = FALSE)
zip_path <- bfcrpath(bfc, "https://zenodo.org/records/20618059/files/rate_estimation_vignette_data.zip")
unzip(zip_path, exdir = tempdir())
data_dir <- file.path(tempdir(), "vignettes", "lrt_test_data")

# TSN (Transcription Start Site Network) files
# These are GRanges objects containing TSS annotations for each sample
cd4_tsn_file <- file.path(data_dir, 'PROseq-HUMAN-CD4_tsn.RDS')

# Transcript annotation file
# Contains gene models with exon structure for all genes
human_tq_file <- file.path(data_dir, 'HUMAN.RDS')

# PRO-seq signal files (stranded bigWig format)
cd4_bw_plus <- file.path(data_dir, 'PROseq-HUMAN-CD4_plus.bw')
cd4_bw_minus <- file.path(data_dir, 'PROseq-HUMAN-CD4_minus.bw')

```

***

# Data Loading and TSS Selection

## Load TSS Annotations

We begin by loading transcription start site (TSS) annotations from the CD4 cell type. These annotations may contain multiple TSSs per gene (alternative promoters). For this analysis, we select a single representative TSS per gene.

```{r load-tss}
# Load TSN data for both cell types
cd4_tsn <- readRDS(cd4_tsn_file)

# Display structure of TSN data
head(cd4_tsn)
```

## Select Single TSS per Gene

When multiple TSSs exist for a single gene, we retain only the TSS with the maximum score. If multiple TSSs have the same score, select the most upstream TSS to break the tie. This ensures consistent TSS usage across the analysis. The "most upstream" is defined as the TSS with the smallest genomic coordinate on the plus strand and the largest coordinate on the minus strand.

```{r select-single-tss}
# Process CD4 data: select most upstream TSS per gene
cd4_tsn <- STADyUM:::keep_max_tsn(cd4_tsn)

cat("CD4 TSS sites:", length(cd4_tsn), "\n")
```

***

# Building Read-Count Regions

## Load Transcript Annotations

We load the full transcript annotation to define gene bodies. From these, we extract the transcription termination sites (TTS) to define the 3' boundary of gene bodies. 

```{r load-transcripts}
# Load transcript model for all genes
human_tq <- readRDS(human_tq_file)

# Extract gene ranges and reduce overlapping exons
# This creates a single range per gene (union of all exons)
gngrng <- human_tq@transcripts %>%
  group_by(ensembl_gene_id) %>%
  plyranges::reduce_ranges_directed() %>%
  sort()

cat("Number of genes:", length(gngrng), "\n")
```

## Define Pause and Gene Body Regions

For STADyUM analysis, we define two counting regions:
- **Pause region**: typically near the TSS (promoter-proximal pause site)
- **Gene body region**: the elongating polymerase region 

Here, we anchor at the transcription termination site (3' end) and define regions relative to those anchors.
The default parameters for building the read count regions using `STADyUM:::build_readcount_regions(bw_tsn, transcripts, tsn_cutoff=5, gb_min_length=6000, gb_trim_len=1250, gb_max_length=90000, kmax=200)` is as shown. 
We filter out low-signal TSNs with low scores before creating pause windows. The gene body is defined as the region from 1,250bp downstream of the TSS to 90 kbp downstream of the TSS. For genes shorter than 90kb, the gene body extends from 1,250bp downstream of the TSS to the TES.

```{r define-count-regions}
# Anchor at 3' end (transcription termination site) and set width to 1
# This creates a point coordinate at the TTS for each gene
bw_tts <- gngrng %>% 
  plyranges::anchor_3p() %>% 
  mutate(width = 1)

# Build read-count regions for CD4
# Returns list with 'pause' and 'gene_body' GRanges
cd4_count_region <- STADyUM:::build_readcount_regions(cd4_tsn, bw_tts)

cat("CD4 pause regions:", length(cd4_count_region$pause), "\n")
cat("CD4 gene body regions:", length(cd4_count_region$gene_body), "\n")
```

***

# Running STADyUM

## Estimate Transcription Rates

The `estimateTranscriptionRates()` function applies STADyUM to count reads in pause and gene body regions. It estimates key kinetic parameters:

- **beta (β)**: pause release rate (1/time)
- **chi (χ)**: elongation rate relative to transcription activity
- **activity**: transcription initiation/activity level

```{r run-stadyum}
# Estimate transcription rates for CD4
# Input: plus and minus strand bigWig files, pause/gene body regions, sample label
cd4_rate <- estimateTranscriptionRates(
  cd4_bw_plus, cd4_bw_minus,
  cd4_count_region$pause, cd4_count_region$gene_body, 
  "CD4"
)

cat("CD4 rate estimation complete\n")
```

***

# Visualization

## CD4 Visualizations

### Pause Release Rate vs. Transcription Activity

This plot shows the relationship between the pause release rate (beta) and transcription activity (chi). Each point represents one gene. Positively correlated genes may indicate coupled regulation of pause release and transcription initiation.

```{r cd4-beta-chi, fig.width=3.5, fig.height=3}

cat("CD4 genes with estimated rates:", nrow(rates(cd4_rate)), "\n")

p1 <- plotScatterDensity(cd4_rate, "betaAdp", "chi",
                          log_x = TRUE, log_y = TRUE,
                          xlab = expression(log[10] ~ beta),
                          ylab = expression(log[10] ~ chi),
                          xlim = c(-6, -1), ylim = c(-4, 1))

if (requireNamespace("ggpubr", quietly = TRUE)) {
  p1 <- p1 + scale_color_viridis_c() +
    ggpubr::stat_cor(method = "spearman", label.x = -3.5, label.y = -3.8,
                     size = 5, aes(label = after_stat(r.label)))
}

print(p1)
```

### Pause Release Rate vs. Elongation Parameters

These plots show relationships between pause release rate (beta) and two measures of elongation variability. The first shows beta vs. mean elongation time; the second shows beta vs. elongation time standard deviation.

```{r cd4-beta-elongation, fig.width=8, fig.height=3}
p1 <- plotScatterDensity(cd4_rate, "betaAdp", "fkMean",
                          log_x = TRUE,
                          xlab = expression(log[10] ~ beta),
                          ylab = expression(mu))
if (requireNamespace("ggpubr", quietly = TRUE)) {
  p1 <- p1 + scale_color_viridis_c() +
    ggpubr::stat_cor(method = "spearman", label.x = -2.2, label.y = 160,
                     size = 5, aes(label = after_stat(r.label)))
}

p2 <- plotScatterDensity(cd4_rate, "betaAdp", "fkSD",
                          log_x = TRUE,
                          xlab = expression(log[10] ~ beta),
                          ylab = expression(sigma),
                          color_var = "sdGroup",
                          color_values = c(Broad = "#1b9e77", Sharp = "#762a83"),
                          color_lab = expression(sigma ~ group))
if (requireNamespace("ggpubr", quietly = TRUE)) {
  p2 <- p2 +
    ggpubr::stat_cor(method = "spearman", label.x = -2.5, label.y = 75,
                     size = 3, aes(label = after_stat(r.label), group = 1),
                     color = "black")
}
p2_hist <- ggplot(rates(cd4_rate), aes(x = .data$fkSD)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50) +
  geom_vline(xintercept = 38.5, linetype = "dashed", color = "black",
             linewidth = 0.6) +
  coord_flip() +
  theme(axis.title = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank())
print(p1)

grid::grid.newpage()
grid::pushViewport(grid::viewport(layout = grid::grid.layout(1, 2, widths = grid::unit(c(4, 1), "null"))))
print(p2,      vp = grid::viewport(layout.pos.row = 1, layout.pos.col = 1))
print(p2_hist, vp = grid::viewport(layout.pos.row = 1, layout.pos.col = 2))

p3 <- plotScatterDensity(cd4_rate, "fkMean", "fkSD",
                          xlab = expression(mu),
                          ylab = expression(sigma))
if (requireNamespace("ggpubr", quietly = TRUE)) {
  p3 <- p3 + scale_color_viridis_c() +
    ggpubr::stat_cor(method = "spearman", label.x = 112, label.y = 1,
                     size = 5, aes(label = after_stat(r.label)))
}

print(p3)

```

***

# Summary

This tutorial demonstrated a complete workflow for estimating transcription rates from PRO-seq data using STADyUM:

1. **TSS Identification**: We identified a single transcription start site using annotations.

2. **Region Definition**: We defined pause (promoter-proximal) and gene body regions as input for rate estimation.

3. **Rate Estimation**: STADyUM estimated three key transcription parameters (pause release rate, elongation characteristics, and activity) from the read distribution in these regions.

4. **Visualization**: We explored relationships between parameters, revealing how pause release and elongation dynamics are coupled in CD4 cells.

The estimated rates are now ready for downstream comparative analysis (e.g., identifying genes with cell-type-specific transcription dynamics) and integration with other genomic data.

***

# Session Information

```{r session-info}
sessionInfo()
```