---
title: "SpliceImpactR"
author:
- name: Zachary Wakefield
  affiliation:
  - &1 Boston University, Boston, MA
  - &2 Bioinformatics Program, Boston University, Boston, MA
- name: Ana Fiszbein
  affiliation:
  - *1
  - *2
date: '`r format(Sys.Date(), "%B %e, %Y")`'
package: SpliceImpactR
output:
    BiocStyle::html_document
vignette: >
    %\VignetteIndexEntry{SpliceImpactR}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
editor_options:
    chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 4.5
)
library(SpliceImpactR)
library(BiocParallel)
```

# Overview
`SpliceImpactR` links differential splicing to protein-level outcomes,
including sequence changes, domain gain/loss, and potential PPI rewiring.

**SpliceImpactR** is a self-contained and efficient pipeline designed
to identify how alternative RNA processing events affect resulting
proteins. After initial alternative RNA processing quantification,
the method runs entirely within **R** and does not require additional
external software.

The pipeline begins by importing alternative RNA processing events
that differ across conditions. It is built to support highly flexible
input, requiring only inclusion and exclusion coordinates for each
event isoform. This makes it compatible with both user-defined events
and outputs from existing tools.

Custom input functions detect multiple classes of alternative RNA
processing, including alternative first exons (**AFE**), hybrid first
exons (**HFE**), retained introns (**RI**), skipped exons (**SE**),
alternative 5′ and 3′ splice sites (**A5SS** and **A3SS**), mutually
exclusive exons (**MXE**), hybrid last exons (**HLE**), and alternative
last exons (**ALE**).

SpliceImpactR then performs differential inclusion analysis to
identify condition-specific transcript pairs that differ by at least
one alternative RNA processing event. These events are mapped to
transcript and protein annotations to evaluate their effects on
coding potential, sequence similarity, and frameshifts.

The pipeline next uses curated and user-provided protein feature sets
to annotate domain-level changes, assess shifts in domain usage across
conditions, and integrate domain-domain interaction data,
domain-motif interaction data, and protein-protein interaction
information to predict downstream effects on protein interaction
networks.

Finally, SpliceImpactR performs a global analysis to detect
co-regulated splicing events and quantify the relative contribution
of each alternative RNA processing type. The pipeline is also
available through an **R Shiny** interface to improve accessibility
and ease of use.

SpliceImpactR is built to handle both `S4` `SpliceImpactResult` object with 
custom accession functions and standard `data.table` input / workflow.
`SpliceImpactResult` objects contain all relevant data along with detailed 
sequence information for each event. This object allows for maximal 
interoperability with any level of data from this pipeline.

This vignette uses bundled test data so every core step is runnable.

## Installation
Install with one of the following methods.

### Option 1: BiocManager
```{r install-biocmanager, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("SpliceImpactR")
```

### Option 2: devtools (GitHub)
```{r install-devtools, eval=FALSE}
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}
devtools::install_github("fiszbein-lab/SpliceImpactR")
```

## External tools and acronyms
- `rMATS` (replicate Multivariate Analysis of Transcript Splicing) provides
  event-level splicing tables.
- `HITindex` quantifies first/last exon usage and labels hybrid exons.
- `PPIDM` Protein-Protein Interaction Domain Miner for domain-domain
  Interaction derived from PPI and 3did's domain-domain interactions
- `ELM` Eukaryotic Linear Motif Database for accessing Short Linear Motif
  occurences and domain-motif interactions
- `BiomaRt` accesses data from InterPro, PFAM, SignalP, TMHMM, CDD, Mobidb-lite

## Workflow map
1. Load reference resources (annotations + protein features).
2. Build sample manifest (`sample_frame`).
3. (Optional) run the quick-start wrapper for end-to-end execution.
4. Read splicing events for stepwise analysis.
5. Run QC summaries and plots.
6. Run differential inclusion.
7. Match DI events to transcript/protein sequences and build pairs.
8. Quantify sequence and domain impacts.
9. Infer PPI rewiring.
10. Build enrichment foreground gene sets from table or S4 inputs.
11. Visualize transcript and protein consequences for one event pair.
12. Generate integrated multi-layer summaries.
13. Use S4 container workflows and accessors.
14. Run optional custom-input entry points.

# Load Reference Resources
`get_protein_features()` supports these feature sources:
- `interpro`: integrated domain/family/superfamily signatures.
- `pfam`: protein domain HMM families.
- `cdd`: NCBI Conserved Domain Database annotations.
- `gene3d`: CATH/Gene3D structural domain annotations.
- `signalp`: signal peptide calls.
- `tmhmm`: transmembrane helix calls.
- `ncoils`: coiled-coil region predictions.
- `seg`: low-complexity segments.
- `mobidblite`: intrinsically disordered region calls.
- `elm`: short linear motifs (SLiMs; retrieved from ELM, not BioMart).

For BioMart-backed sources, each feature must provide three attributes:
`{feature}`, `{feature}_start`, and `{feature}_end` (for example
`pfam`, `pfam_start`, `pfam_end`). You can add non-BioMart custom features
with `get_manual_features()` (explained further down in the
Custom Input Entry Points section) and merge everything through
`get_comprehensive_annotations()`.

We load test annotations and three representative feature sources
(`interpro`, `signalp`, `elm`) and combine them. Loading multiple databases in
one call is possible, but not recommended due to timeouts. These are 
automatically cached and loaded from cache if the correct signature (organism,
version, etc) is found. BiomaRt can give difficulties in loading these, so this
procedure will sometimes take a few tries. If this is unsuccessful, restart R 
and/or try suggested methods related to this issue (eg: `httr::reset_config()`)

```{r load-resources}
library(SpliceImpactR)
annotation_df <- get_annotation(load = "test")

interpro_features <- get_protein_features(
  biomaRt_databases = c("interpro"),
  gtf_df = annotation_df$annotations,
  test = TRUE
)

signalp_features <- get_protein_features(
  biomaRt_databases = c("signalp"),
  gtf_df = annotation_df$annotations,
  test = TRUE
)

elm_features <- get_protein_features(
  biomaRt_databases = c("elm"),
  gtf_df = annotation_df$annotations,
  test = TRUE
)

protein_feature_total <- get_comprehensive_annotations(
  list(interpro_features, signalp_features, elm_features)
)

exon_features <- get_exon_features(
  annotation_df$annotations,
  protein_feature_total
)
```

```{r feature-summary}
table(protein_feature_total$database)
```

# Build Sample Manifest
Each sample directory contains event files exported by `rMATS` and/or
`HITindex`. The manifest below points to the bundled test directories. Using
this function, SpliceImpactR will search each directory for the outputs provided
by rMATS and HIT Index output files ending in `.AFEPSI`, `.ALEPSI`, and `.exon`.
Label one condition `case` and one condition `control`.

Required sample manifest columns:
- `path`: per-sample directory containing splicing output files.
- `sample_name`: unique sample identifier.
- `condition`: case/control label used in DI and downstream pairing.

Manifest expectations:
- One row per sample.
- `path` values must point to readable sample directories.
- Use exactly two conditions for this standard workflow (`case`, `control`).
- Include replicate samples per condition for robust differential modeling.

```{r load-data}
sample_frame <- data.frame(
  path = c(
    file.path(system.file("extdata", package = "SpliceImpactR"),
              "rawData/control_S5/"),
    file.path(system.file("extdata", package = "SpliceImpactR"),
              "rawData/control_S6/"),
    file.path(system.file("extdata", package = "SpliceImpactR"),
              "rawData/control_S7/"),
    file.path(system.file("extdata", package = "SpliceImpactR"),
              "rawData/control_S8/"),
    file.path(system.file("extdata", package = "SpliceImpactR"),
              "rawData/case_S1/"),
    file.path(system.file("extdata", package = "SpliceImpactR"),
              "rawData/case_S2/"),
    file.path(system.file("extdata", package = "SpliceImpactR"),
              "rawData/case_S3/"),
    file.path(system.file("extdata", package = "SpliceImpactR"),
              "rawData/case_S4/")
  ),
  sample_name = c("S5", "S6", "S7", "S8", "S1", "S2", "S3", "S4"),
  condition = c(
    "control", "control", "control", "control",
    "case", "case", "case", "case"
  ),
  stringsAsFactors = FALSE
)
```

## Quick start wrapper (`get_splicing_impact`)
If your inputs already follow the standard layout, run the full workflow with
the top-level wrapper and choose table or S4 return mode.

```{r wrapper-quickstart, eval=FALSE}
# data.table-first return
out_dt <- get_splicing_impact(
  sample_frame = sample_frame,
  source_data = "both",
  annotation_df = annotation_df,
  protein_feature_total = protein_feature_total,
  exon_features = exon_features,
  return_class = "data.table"
)

# S4-first return (single object carries all slots)
out_s4 <- get_splicing_impact(
  sample_frame = sample_frame,
  source_data = "both",
  annotation_df = annotation_df,
  protein_feature_total = protein_feature_total,
  exon_features = exon_features,
  return_class = "S4"
)
```

# Read Splicing Events
For the stepwise workflow in this vignette, load event data explicitly:

```{r read-splicing-events}
data <- get_rmats_hit(
  sample_frame,
  event_types = c("ALE", "AFE", "MXE", "SE", "A3SS", "A5SS", "RI")
)

DT <- data.table::as.data.table(data)
DT[, .(
  n_rows = .N,
  n_events = data.table::uniqueN(event_id),
  n_genes = data.table::uniqueN(gene_id)
)]
```

If `rMATS` and `HITindex` outputs are stored in separate per-sample
directories, load them independently and combine on shared columns.
In this test-data example, both manifests point to the same bundled paths;
in real analyses, `sample_frame_rmats$path` and `sample_frame_hit$path`
can be different.

```{r load-data-separate}
sample_frame_rmats <- sample_frame
sample_frame_hit <- sample_frame

rmats_only <- get_rmats(
  load_rmats(
    sample_frame_rmats,
    use = "JCEC",
    event_types = c("MXE", "SE", "A3SS", "A5SS", "RI")
  )
)

hit_only <- get_hitindex(
  sample_frame_hit,
  keep_annotated_first_last = TRUE
)

shared_cols <- intersect(names(rmats_only), names(hit_only))
data_separate <- data.table::rbindlist(
  list(
    rmats_only[, ..shared_cols],
    hit_only[, ..shared_cols]
  ),
  use.names = TRUE,
  fill = TRUE
)

data_separate[, .(
  n_rows = .N,
  n_events = data.table::uniqueN(event_id),
  n_genes = data.table::uniqueN(gene_id)
)]
```

# Sample-Level QC and Exploration
## Compare HITindex Between Conditions
This summary compares event-level mean HIT values across conditions.

```{r hitindex-compare, fig.cap="HITindex condition comparison."}
hit_compare <- compare_hit_index(
  sample_frame,
  condition_map = c(control = "control", test = "case")
)

hit_compare$plot
```

```{r hitindex-table}
head(hit_compare$results[order(fdr)], 6)
```

## PSI Distribution Overview
This panel summarizes event count depth-normalized metrics and PSI eCDF.

```{r overview-psi, fig.cap="Depth-normalized event counts and PSI overview."}
overview_plot <- overview_spicing_comparison(
  events = data,
  sample_df = sample_frame,
  depth_norm = "exon_files",
  event_type = "AFE"
)

overview_plot
```

# Differential Inclusion (DI)
We run quasi-binomial GLMs and filter significant events.
Verbose filtering messages are line-wrapped so progress logs remain readable
without horizontal scrolling. For efficient processing, set `parallel_glm` to 
`TRUE` and provide a `BiocParallel` object to `BPPARAM`.

Default thresholds in `keep_sig_pairs` is FDR < 0.05 and |DELTA PSI| > 0.1.

```{r differential-inclusion}
res <- get_differential_inclusion(
  data,
  min_total_reads = 10,
  parallel_glm = TRUE,
  BPPARAM = BiocParallel::SerialParam()
)

res_di <- keep_sig_pairs(res)

res[, .(
  n_rows = .N,
  n_sig = sum(padj <= 0.05 & abs(delta_psi) >= 0.1, na.rm = TRUE)
)]
```

```{r di-volcano, fig.cap="Differential inclusion volcano plot."}
plot_di_volcano_dt(res)
```

# Match Events to Sequences and Build Transcript Pairs
Significant DI events are mapped to annotation and sequence data, then
collapsed into case/control transcript pairs.
This matching is done through a strict hierarchy:

1. Prefilter by `chr`, `strand`, and `gene_id` to keep only compatible
   annotation intervals.
2. Match inclusion (`inc`) coordinates to exons and require all inclusion
   parts to be covered for a transcript candidate.
3. Remove candidates that overlap exclusion (`exc`) coordinates.
4. Prioritize transcript/exon choices by event-type-consistent exon class
   (`first`, `internal`, `last`), then reciprocal overlap and intersection
   width; protein-linked transcripts are preferred when available.
5. Build case/control pairs in `get_pairs(source = "multi")` by joining all
   positive `delta_psi` rows (case) with all negative rows (control) for each
   `event_id`, then ordering by strongest `|delta_psi|`.

```{r matching-and-pairs}
matched <- get_matched_events_chunked(
  res_di,
  annotation_df$annotations,
  chunk_size = 2000
)

hits_sequences <- attach_sequences(matched, annotation_df$sequences)
pairs <- get_pairs(hits_sequences, source = "multi")

pairs[, .(
  n_pairs = .N,
  n_events = data.table::uniqueN(event_id)
)]
```

Once pairing is done, we can probe whether there is a proximal/distal shift
in AFE/ALE usage.

```{r proximal-distal, fig.cap="Proximal versus distal terminal exon summary."}
proximal_output <- get_proximal_shift_from_hits(pairs)
proximal_output$plot
```

# Sequence-Level Consequences
`compare_sequence_frame()` computes protein-coding class,
alignment identity, frameshift/rescue status, and length deltas.

Key labels used downstream:
- `protein_coding`: both isoforms have protein IDs (both sides coding).
- `onePC`: only one side has a protein ID.
- `noPC`: neither side has a protein ID.
- `Match`: protein sequences are identical.
- `FrameShift`: frame is disrupted between paired isoforms.

```{r sequence-compare}
seq_compare <- compare_sequence_frame(pairs, annotation_df$annotations)
```

```{r alignment-summary, fig.cap="Protein alignment summary by event type."}
alignment_plot <- plot_alignment_summary(seq_compare)
alignment_plot
```

```{r length-summary, fig.cap="Case versus control isoform length comparison."}
length_plot <- plot_length_comparison(seq_compare)
length_plot
```

# Domain-Level Changes and Domain Enrichment
Background transcript pairs are derived from annotations and compared
against observed event-linked domain changes. 

```{r domain-background-and-hits}
bg <- get_background(
  source = "annotated",
  annotations = annotation_df$annotations,
  protein_features = protein_feature_total,
  BPPARAM = BiocParallel::SerialParam()
)

hits_domain <- get_domains(seq_compare, exon_features)

hits_domain[, .(
  n_hits = .N,
  n_domain_changed = sum(diff_n > 0, na.rm = TRUE)
)]
```

```{r domain-enrichment, fig.cap="Top enriched InterPro domain differences."}
enriched_domains <- enrich_domains_hypergeo(
  hits = hits_domain,
  background = bg,
  db_filter = "interpro"
)

domain_plot <- plot_enriched_domains_counts(enriched_domains, top_n = 20)
domain_plot
```

You can also stratify enrichment by event type or by feature database.

```{r domain-enrichment-by-event-db}
enriched_afe_interpro <- enrich_by_event(
  hits = hits_domain,
  background = bg,
  events = "AFE",
  db_filter = "interpro"
)

enriched_interpro_only <- enrich_by_db(
  hits = hits_domain,
  background = bg,
  dbs = "interpro"
)

list(
  by_event_rows = nrow(enriched_afe_interpro),
  by_db_rows = nrow(enriched_interpro_only)
)
```

# PPI Rewiring
PPI effects are inferred from domain and motif changes. Domain-domain
and domain-motif interactions are sourced from 3did/PPDIM and ELM. PPI rewiring
currently only works for human inputs.

```{r ppi-effects, fig.cap="Summary of predicted case/control PPI changes."}
ppi <- get_ppi_interactions()

hits_final <- get_ppi_switches(
  hits_domain,
  ppi,
  protein_feature_total
)

ppi_plot <- plot_ppi_summary(hits_final)
ppi_plot
```

# Gene Enrichment Foreground Probes
`get_gene_enrichment()` extracts foreground genes for enrichment.

- `mode = "di"` expects a DI-like table (`res`).
- `mode = "ppi"`/`"domain"` expects a hits-like table (`hits_final` or
  `hits_domain`).
- S4 examples are shown in Section 13 so object setup stays in one place.

```{r enrichment-foregrounds}
fg_di <- get_gene_enrichment(
  mode = "di",
  res = res,
  padj_threshold = 0.1,
  delta_psi_threshold = 0.1
)

fg_domain <- get_gene_enrichment(mode = "domain", hits = hits_domain)
fg_ppi <- get_gene_enrichment(mode = "ppi", hits = hits_final)

lengths(list(di = fg_di, domain = fg_domain, ppi = fg_ppi))
intersect(fg_di, fg_ppi)
```

The optional `get_enrichment()` step depends on additional packages and may
return no significant terms on small toy datasets. If you see all `NA`
statistics, foreground genes often map too sparsely into the selected
background and ontology. `p_adjust_cutoff` and `top_n_plot` are set 
to allow for example visualization, but should be adjusted to standard
(eg: 0.05, 10) for non-toy data uses.

```{r run-get-enrichment}
needed <- c("clusterProfiler", "msigdbr", "org.Hs.eg.db")
has_needed <- all(vapply(
  needed,
  requireNamespace,
  FUN.VALUE = logical(1),
  quietly = TRUE
))

if (has_needed) {
  enrichment_di <- get_enrichment(
    foreground = fg_domain,
    background = bg$gene_id,
    species = "human",
    gene_id_type = "ensembl",
    sources = "GO:BP",
    p_adjust_cutoff = 1, ## CHANGE FOR NON-TOY DATA USAGE, eg: 0.05
    top_n_plot = 1 ## CHANGE FOR NON-TOY DATA USAGE, eg: NULL or 10
  )

  enrichment_di$plot
} else {
  message(
    "Skipping get_enrichment(): install ",
    paste(needed, collapse = ", ")
  )
}
```

# Visualize Transcript and Protein Consequences
`plot_two_transcripts_with_domains_unified()` shows matched isoforms in either
transcript (genome-coordinate) or protein (amino-acid) view.

```{r transcript-protein-views, fig.cap="Transcript and protein views for one paired event."}
viz_pair <- hits_final[
  !is.na(transcript_id_case) &
    !is.na(transcript_id_control) &
    transcript_id_case != "" &
    transcript_id_control != ""
][1]

if (nrow(viz_pair) == 0) {
  stop("No transcript pairs available for visualization.")
}

tx_pair <- c(viz_pair$transcript_id_case, viz_pair$transcript_id_control)

transcript_centric <- plot_two_transcripts_with_domains_unified(
  transcripts = tx_pair,
  gtf_df = annotation_df$annotations,
  protein_features = protein_feature_total,
  feature_db = c("interpro"),
  combine_domains = TRUE,
  view = "transcript"
)

protein_centric <- plot_two_transcripts_with_domains_unified(
  transcripts = tx_pair,
  gtf_df = annotation_df$annotations,
  protein_features = protein_feature_total,
  feature_db = c("interpro"),
  combine_domains = TRUE,
  view = "protein"
)

transcript_centric
protein_centric
```

# Inspect a Single Event
`probe_individual_event()` plots PSI for one event across samples.

```{r probe-event, fig.cap="PSI distribution for one selected event."}
event_to_probe <- res$event_id[1]
probe <- probe_individual_event(data, event = event_to_probe)
probe$plot
```

```{r probe-table}
head(probe$data)
```

# Integrative Visualization
`integrated_event_summary()` combines sequence, domain, and PPI summaries.

Panel guide:
- Top-left: event classification composition (`Match`, `FrameShift`, etc.) by
  event type.
- Top-right: alignment score distributions by event type.
- Middle-left: domain-change prevalence (`Any`, case-only, control-only, both).
- Middle-center/right: fraction and counts of predicted PPI rewiring.
- Bottom-left: relative event retention from pre-filter DI to final hits.
- Bottom-right: gene-level event-type coordination (Jaccard heatmap).

```{r integrated-summary, fig.cap="Integrated multi-layer summary.", fig.width=14, fig.height=10}
int_summary <- integrated_event_summary(hits_final, res)
int_summary$plot
```

```{r integrated-summary-table}
int_summary$summaries$relative_use
```

# Output Description (`hits_final`, `data`, `res`)
The pipeline returns three core tables in `data.table` mode:
- `data`: sample-level event measurements before differential modeling.
- `res`: differential inclusion results per tested event/site.
- `hits_final`: paired case/control isoform effects with sequence, domain, and
  PPI annotations.

Suffix convention used throughout:
- `_case` = case-preferred isoform values.
- `_control` = control-preferred isoform values.

## `hits_final` (integrated event-level output)
Use this table for biological interpretation and downstream plotting.

1. Event and isoform identifiers:
`event_id`, `event_type`, `gene_id`, `chr`, `strand`,
`transcript_id_case`, `transcript_id_control`,
`protein_id_case`, `protein_id_control`, `form_case`, `form_control`,
`exons_case`, `exons_control`.
2. Event coordinates and DI statistics:
`inc_case`, `inc_control`, `exc_case`, `exc_control`,
`delta_psi_case`, `delta_psi_control`,
`p.value_case`, `p.value_control`, `padj_case`, `padj_control`,
`n_samples_case`, `n_samples_control`,
`n_case_case`, `n_case_control`, `n_control_case`, `n_control_control`.
3. Sequence and coding context:
`transcript_seq_case`, `transcript_seq_control`,
`protein_seq_case`, `protein_seq_control`, `pc_class`,
length columns (`prot_len_*`, `tx_len_*`, `exon_cds_len_*`, `exon_len_*`)
and their `*_diff` / `*_diff_abs` derivatives.
4. Alignment/frame classification:
DNA alignment (`dna_pid`, `dna_score`, `dna_width`),
protein alignment (`prot_pid`, `prot_score`, `prot_width`),
frame diagnostics (`frame_call`, `rescue`, `frame_check_exon1`,
`frame_check_exon2`), and `summary_classification`.
5. Domain-level change annotations:
`domains_exons_case`, `domains_exons_control`,
`case_only_domains`, `control_only_domains`,
`case_only_domains_list`, `control_only_domains_list`,
`either_domains_list`, `case_only_n`, `control_only_n`, `diff_n`.
6. Predicted interaction rewiring:
`case_ppi`, `control_ppi`, `n_case_ppi`, `n_control_ppi`, `n_ppi`,
`case_ppi_drivers`, `control_ppi_drivers`.

## `data` (raw sample-level input table)
Use `data` to inspect per-sample evidence feeding DI.

Core columns:
`event_id`, `event_type`, `form`, `gene_id`, `chr`, `strand`, `inc`, `exc`,
`inclusion_reads`, `exclusion_reads`, `psi`, `sample`, `condition`,
`source_file`.

Often present depending on import path:
HITindex metadata such as `HITindex`, `class`, `nFE`, `nLE`, `nUP`, `nDOWN`,
`nTXPT`, `psi_original`, `total_reads`, `source`.

## `res` (differential inclusion output)
Use `res` to rank significant events before pairing/domain/PPI steps.

Core columns:
`site_id`, `event_id`, `event_type`, `gene_id`, `chr`, `strand`, `inc`, `exc`,
`form`, `n_samples`, `n_control`, `n_case`, `mean_psi_ctrl`, `mean_psi_case`,
`delta_psi`, `p.value`, `padj`, `cooks_max`, `n`, `n_used`.

# S4 Applications and Accessors
Use `SpliceImpactResult` when you want all pipeline pieces in one object,
with consistent filtering across event-linked slots.

`SpliceImpactResult` is a custom S4 container that keeps all major pipeline
parts synchronized:
- `raw_events` (`SummarizedExperiment`): sample-level table + ranges/assays.
- `di_events` / `res_di` (`GRanges`): differential inclusion rows.
- `matched` (`DFrame`): annotation-matched rows (and sequence-attached rows).
- `paired_hits` (`GRanges`) + `segments` (`GRangesList`): final case/control
  event impacts.
- `sample_frame` (`DFrame`): sample manifest metadata.

```{r s4-init}
obj <- as_splice_impact_result(
  data = data,
  res = res,
  res_di = res_di,
  matched = hits_sequences,
  sample_frame = sample_frame,
  hits_final = hits_final
)
```

`get_hits_*()` wrappers provide compact views for common downstream tasks:
- `get_hits_core()`: event IDs, transcript/protein IDs, and key DI metadata.
- `get_hits_domain()`: domain gain/loss columns (`case_only_*`, `control_only_*`,
  `diff_n`, mapped domain lists).
- `get_hits_ppi()`: PPI partner changes (`n_ppi`, case/control partner lists,
  PPI drivers).
- `get_hits_sequence()`: sequence/alignment/frame fields (`pc_class`,
  `frame_call`, protein/transcript length deltas, identity metrics).

```{r s4-accessors}
raw_dt <- as_dt_from_s4(obj, "raw_events")
di_dt <- as_dt_from_s4(obj, "di_events")
hits_dt <- as_dt_from_s4(obj, "paired_hits")

hits_core <- get_hits_core(obj)
hits_domain_view <- get_hits_domain(obj)
hits_ppi_view <- get_hits_ppi(obj)
hits_sequence_view <- get_hits_sequence(obj)

c(
  raw_rows = nrow(raw_dt),
  di_rows = nrow(di_dt),
  hits_rows = nrow(hits_dt),
  hits_core_rows = nrow(hits_core),
  hits_domain_rows = nrow(hits_domain_view),
  hits_ppi_rows = nrow(hits_ppi_view),
  hits_sequence_rows = nrow(hits_sequence_view)
)
```

```{r s4-filtering}
obj_focus <- filter_spliceimpact_hits(
  obj,
  diff_n > 0,
  n_ppi > 0
)

focus_hits <- as_dt_from_s4(obj_focus, "paired_hits")
focus_hits[, .N, by = event_type][order(-N)]
```

```{r s4-enrichment}
fg_di_s4 <- get_gene_enrichment(
  mode = "di",
  x = obj,
  padj_threshold = 0.05,
  delta_psi_threshold = 0.1
)

fg_ppi_s4 <- get_gene_enrichment(mode = "ppi", x = obj)

fg_ppi_focus <- get_gene_enrichment(mode = "ppi", x = obj_focus)
list(
  di_equal_table = identical(sort(fg_di), sort(fg_di_s4)),
  ppi_equal_table = identical(sort(fg_ppi), sort(fg_ppi_s4)),
  focused_ppi_genes = head(fg_ppi_focus)
)
```

```{r s4-schema}
spliceimpact_s4_schema()
```

## S4-first Main Workflow (Code Only)
This mirrors the table workflow, but keeps everything in a single
`SpliceImpactResult` object.

```{r s4-main-workflow-code-only, eval=FALSE}
obj_flow <- as_splice_impact_result(
  data = data,
  sample_frame = sample_frame
)

obj_flow <- get_differential_inclusion(
  obj_flow,
  min_total_reads = 10,
  parallel_glm = TRUE,
  BPPARAM = BiocParallel::SerialParam(),
  return_class = "S4"
)
obj_flow <- keep_sig_pairs(obj_flow, return_class = "S4")

obj_flow <- get_matched_events_chunked(
  obj_flow,
  annotation_df$annotations,
  chunk_size = 2000,
  return_class = "S4"
)
obj_flow <- attach_sequences(
  obj_flow,
  annotation_df$sequences,
  return_class = "S4"
)
obj_flow <- get_pairs(obj_flow, source = "multi", return_class = "S4")
obj_flow <- compare_sequence_frame(
  obj_flow,
  annotation_df$annotations,
  return_class = "S4"
)
obj_flow <- get_domains(obj_flow, exon_features, return_class = "S4")
obj_flow <- get_ppi_switches(
  obj_flow,
  ppi,
  protein_feature_total,
  return_class = "S4"
)

hits_core_flow <- get_hits_core(obj_flow)
hits_domain_flow <- get_hits_domain(obj_flow)
hits_ppi_flow <- get_hits_ppi(obj_flow)
hits_sequence_flow <- get_hits_sequence(obj_flow)
```

# Custom Input Entry Points
These helpers support non-standard inputs at multiple stages of the workflow.

## Add Manual Protein Features
`get_manual_features()` lets you add user-defined peptide-level features.

```{r custom-manual-features}
ann_dt <- data.table::as.data.table(annotation_df$annotations)
coding_tx <- unique(ann_dt[type == "exon" & cds_has == TRUE, transcript_id])
n_manual <- min(3L, length(coding_tx))

if (n_manual < 1L) {
  stop("No coding transcripts found for manual feature example.")
}

manual_df <- data.frame(
  ensembl_transcript_id = coding_tx[seq_len(n_manual)],
  ensembl_peptide_id = rep(NA_character_, n_manual),
  name = paste0("demo_feature_", seq_len(n_manual)),
  start = c(20L, 45L, 80L)[seq_len(n_manual)],
  stop = c(35L, 58L, 92L)[seq_len(n_manual)],
  database = rep("manual", n_manual),
  alt_name = rep(NA_character_, n_manual),
  feature_id = rep(NA_character_, n_manual),
  stringsAsFactors = FALSE
)

manual_features <- get_manual_features(
  manual_features = manual_df,
  gtf_df = annotation_df$annotations
)

head(manual_features)
```

## Pre-DI Custom Table
Use `get_user_data()` when you already have sample-level reads/PSI.

```{r custom-pre-di}
example_df <- data.frame(
  event_id = rep("A3SS:1", 8),
  event_type = "A3SS",
  form = rep(c("INC", "EXC"), each = 4),
  gene_id = "ENSG00000158286",
  chr = "chrX",
  strand = "-",
  inc = c(rep("149608626-149608834", 4), rep("149608626-149608829", 4)),
  exc = c(rep("", 4), rep("149608830-149608834", 4)),
  inclusion_reads = c(30, 32, 29, 31, 2, 3, 4, 3),
  exclusion_reads = c(1, 1, 2, 1, 28, 27, 26, 30),
  sample = c("S1", "S2", "S3", "S4", "S1", "S2", "S3", "S4"),
  condition = rep(c("case", "case", "control", "control"), 2),
  stringsAsFactors = FALSE
)
example_df$psi <- example_df$inclusion_reads /
  (example_df$inclusion_reads + example_df$exclusion_reads)

user_data <- get_user_data(example_df)
head(user_data)
```

## Post-DI Custom Table
Use `get_user_data_post_di()` when you already have event-level DI summaries.

```{r custom-post-di}
example_user_data <- data.frame(
  event_id = rep("A3SS:1", 8),
  event_type = "A3SS",
  gene_id = "ENSG00000158286",
  chr = "chrX",
  strand = "-",
  form = rep(c("INC", "EXC"), each = 4),
  inc = c(rep("149608626-149608834", 4), rep("149608626-149608829", 4)),
  exc = c(rep("", 4), rep("149608830-149608834", 4)),
  stringsAsFactors = FALSE
)

user_res <- get_user_data_post_di(example_user_data)
head(user_res)
```

## Post-DI rMATS Import
Use `get_rmats_post_di()` to ingest already-computed rMATS result tables.

```{r custom-rmats-post-di}
rmats_df <- data.frame(
  ID = 1L,
  GeneID = "ENSG00000182871",
  geneSymbol = "COL18A1",
  chr = "chr21",
  strand = "+",
  longExonStart_0base = 45505834L,
  longExonEnd = 45505966L,
  shortES = 45505837L,
  shortEE = 45505966L,
  flankingES = 45505357L,
  flankingEE = 45505431L,
  ID.2 = 2L,
  IJC_SAMPLE_1 = "1,1,1",
  SJC_SAMPLE_1 = "1,1,1",
  IJC_SAMPLE_2 = "1,1,1",
  SJC_SAMPLE_2 = "1,1,1",
  IncFormLen = 52L,
  SkipFormLen = 49L,
  PValue = 0.6967562,
  FDR = 1,
  IncLevel1 = "0.0,0.0,0.0",
  IncLevel2 = "1.0,1.0,1.0",
  IncLevelDifference = 1.0,
  stringsAsFactors = FALSE
)

user_res_rmats <- get_rmats_post_di(rmats_df, event_type = "A3SS")
head(user_res_rmats)
```

## Transcript-Pair Entry Point
Use `compare_transcript_pairs()` when you want to start directly from chosen
transcript pairs instead of event matching.

```{r custom-transcript-pairs}
tx_ids <- unique(annotation_df$annotations$transcript_id)
tx_ids <- tx_ids[!is.na(tx_ids) & tx_ids != ""]

if (length(tx_ids) < 4L) {
  stop("Need at least four transcripts for transcript-pair example.")
}

transcript_pairs <- data.frame(
  transcript1 = tx_ids[1:2],
  transcript2 = tx_ids[3:4],
  stringsAsFactors = FALSE
)

user_matched <- compare_transcript_pairs(
  transcript_pairs = transcript_pairs,
  annotations = annotation_df$annotations
)

head(user_matched)
```

# Session Info
```{r}
sessionInfo()
```