OutSplice

Introduction

OutSplice is a package that identifies aberrant splicing events in tumor samples when compared to normal samples using outlier statistics to better characterize sporadic events occurring in heterogenous cancer samples. Events are classified as skipping, insertion, or deletion events. Additionally, outlier expression of each splicing event is called per sample allowing for evaluation of splicing burden of individual samples. This package is specifically designed to analyze RNA sequencing data provided by the user; additional functions have also been added for direct analysis from the Cancer Genome Atlas (TCGA). Packages in Bioconductor, such as the psichomics package, perform similar analyses on TCGA or user-generated data, however they are not specifically designed to identify cancer specific events. Because of the heterogeneity of tumors in the context of cancer biology, outlier statistics rather than mean comparisons are used to identify aberrant events, where outliers are classified on a per sample basis; this allows the algorithm to discern splicing events with outlier expression that may occur within a subset of tumor samples when compared to normal conditions. Therefore, this package generates a matrix of splicing outliers, which are associated with splice junctions that are either significantly over or under-expressed compared to the distribution of expression in normal tissue. The FRASER package also offers functionality to determine splicing outliers, however, OutSplice utilizes a rank sum outlier approach instead of a beta binomial model. This allows for users to eliminate less biologically relevant events by setting a minimum level of event expression. Additionally, OutSplice has an easy to use work-flow that makes it accessible for researchers of varying experience with bioinformatics tools. With OutSplice, the junction normalization, filtering, outlier calling, and statistical analysis are done in one function call, simplifying this process for the user. OutSplice is also capable of automatically calculating the splicing burden in each of the user’s samples and will provide waterfall plots of expression to help visualize individual splicing events. Overall, this package is novel in that it determines differential splicing burdens between tumors and normal samples and characterizes the nature of splicing outliers in a quick and user friendly fashion.

Functionality

The main functions of OutSplice achieve the following for either user provided data or data provided from the TCGA.

  1. Junction normalization
  2. Outlier analysis
  3. Determination of a junctional outlier as a skipping, insertion, or deletion
  4. Calculation of splicing burden
  5. Plotting expression levels

Minimum required packages

OutSplice will import the below packages and their dependencies so please ensure they are installed before using the software:

  1. AnnotationDbi
  2. GenomicRanges
  3. GenomicFeatures
  4. IRanges
  5. org.Hs.eg.db
  6. Repitools
  7. TxDb.Hsapiens.UCSC.hg19.knownGene
  8. TxDb.Hsapiens.UCSC.hg38.knownGene
  9. S4Vectors

Installation from Bioconductor

The OutSplice package is available at https://bioconductor.org and can be installed via BiocManager::install:

A package only needs to be installed once. Load the package into an R session with:

library(OutSplice)

Inputs

outspliceAnalysis/outspliceTCGA

Full Sample data for OutSplice can be downloaded from The Cancer Genome Atlas (TCGA).

The outspliceAnalysis function has 4 required inputs and 13 optional Inputs. The outspliceTCGA function has 3 required inputs and 9 optional Inputs and should only be used if the data is in TCGA format.

In the below examples we run outspliceAnalysis and outspliceTCGA on a subset of Head and Neck squamous cell carcinoma data obtained from The Broad Institute TCGA GDAC Firehose database.

junction <- system.file("extdata", "HNSC_junctions.txt.gz", package = "OutSplice")
gene_expr <- system.file("extdata", "HNSC_genes_normalized.txt.gz", package = "OutSplice")
rawcounts <- system.file("extdata", "Total_Rawcounts.txt", package = "OutSplice")
sample_labels <- system.file("extdata", "HNSC_pheno_table.txt", package = "OutSplice")
output_file_prefix <- "OutSplice_Example"
TxDb_hg19 <- "TxDb.Hsapiens.UCSC.hg19.knownGene"
dir <- paste0(tempdir(), "/")
message("Output is located at: ", dir)
## Output is located at: /tmp/RtmpdCxhCY/
results <- outspliceAnalysis(junction, gene_expr, rawcounts, sample_labels, saveOutput = TRUE, output_file_prefix, dir, filterSex = TRUE, annotation = "org.Hs.eg.db", TxDb = TxDb_hg19, offsets_value = 0.00001, correction_setting = "fdr", p_value = 0.05)
## Loading data
## pheno
## Normal  Tumor 
##     44     50
## convert to RPM
## filter the putative junctions
## run the ogsa function for pre filtering
## get the genomic information for all the junctions
##   403 genes were dropped because they have exons located on both strands of the
##   same reference sequence or on more than one reference sequence, so cannot be
##   represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
##   object, or use suppressMessages() to suppress this message.
## 'select()' returned 1:1 mapping between keys and columns
## filtering for putative junctions, skipping, insertion and deletion events based on known exons.
## skipping
## insertions
## deletions
## remove all that map to 'NA' no gene name, and assign gene expression from gene_expr
## align with gene_expr data
## subset removing any genes without normalization
## Perform normalization using gene_expr values
## run the ogsa function
junction <- system.file("extdata", "TCGA_HNSC_junctions.txt.gz", package = "OutSplice")
gene_expr <- system.file("extdata", "TCGA_HNSC_genes_normalized.txt.gz", package = "OutSplice")
rawcounts <- system.file("extdata", "Total_Rawcounts.txt", package = "OutSplice")
output_file_prefix <- "TCGA_OutSplice_Example"
dir <- paste0(tempdir(), "/")
message("Output is located at: ", dir)
## Output is located at: /tmp/RtmpdCxhCY/
results_TCGA <- outspliceTCGA(junction, gene_expr, rawcounts, saveOutput = TRUE, output_file_prefix, dir, filterSex = TRUE, annotation = "org.Hs.eg.db", TxDb = "TxDb.Hsapiens.UCSC.hg19.knownGene", offsets_value = 0.00001, correction_setting = "fdr", p_value = 0.05)
## Loading data
## pheno
## Normal  Tumor 
##     44     50
## convert to RPM
## filter the putative junctions
## run the ogsa function for pre filtering
## get the genomic information for all the junctions
##   403 genes were dropped because they have exons located on both strands of the
##   same reference sequence or on more than one reference sequence, so cannot be
##   represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList
##   object, or use suppressMessages() to suppress this message.
## 'select()' returned 1:1 mapping between keys and columns
## filtering for putative junctions, skipping, insertion and deletion events based on known exons.
## skipping
## insertions
## deletions
## remove all that map to 'NA' no gene name, and assign gene expression from gene_expr
## align with gene_expr data
## subset removing any genes without normalization
## Perform normalization using gene_expr values
## run the ogsa function

-junction: filepath to a matrix of junction raw read counts

-gene_expr: filepath to a matrix of normalized gene expression data. RSEM quartile normalized data is recommended.

-rawcounts: filepath to a matrix of raw read counts for each gene. Can either be per gene, or a summed total for each sample.

-sample_labels: filepath to a matrix of phenotypes (in Section 2.2)

-saveOutput: saves results output to a user specified directory. Default is FALSE [OPTIONAL]

-output_file_prefix: user defined string for what the prefix of the output file should be named. Default is NULL [OPTIONAL]

-dir: string containing the desired output path. Ex) “~/my_outsplice_output/”. Default is NULL [OPTIONAL]

-filtersex: ignores sex chromosomes when generating results. Default is TRUE [OPTIONAL]

-annotation: the bioconductor package containing the annotations to use. Uses the human genome by default (in Section 3.4) [OPTIONAL]

-TxDb: the bioconductor package containing the transcript annotations to use. The outspliceAnalysis function uses the hg38 genome by default, and outspliceTCGA uses hg19 by default (in Section 3.4) [OPTIONAL]

-offsets_value: the normalized junction expression threshold. Uses 0.00001 by Default (in Section 3.6) [OPTIONAL]

-correction_setting: the correction value to be used for p-value adjustment during Fisher analyses. The available options are: “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, and “none”. Uses ‘fdr’ by Default. [OPTIONAL]

-p_value: the significance threshold to use during Fisher analyses. Uses 0.05 by default. [OPTIONAL]

-use_junc_col: an integer indicating which column in the junction matrix contains the junction regions in your matricies. Uses the first column (1) by default. outSpliceAnalysis only. [OPTIONAL]

-use_gene_col: which column in the gene_expr matrix contains the entrez ids of your genes. Uses the first column (1) by default. outSpliceAnalysis only. [OPTIONAL]

-use_rc_col: which column in the rawcounts matrix contains the row names. Uses the first column (1) by default. outSpliceAnalysis only. [OPTIONAL]

-use_labels_col: which column in the sample_labels matrix contains the sample names. Uses the first column (1) by default. outSpliceAnalysis only. [OPTIONAL]

This algorithm is compatible with any organism provided the genome object, genome wide annotation, and transcript annotations exist as packages on Bioconductor.

Phenotype matrix

If using the outspliceAnalysis function, users must provide a phenotype matrix, designating which samples in the junction file belong to the tumor group (labeled as “T”) and the normal group (labeled as “F”). Please ensure the matrix file contains a header row with the first column designating the sample names, and the second column designating the phenotype. If using TCGA data, the two phenotypes are “tumor” and “normal.” outspliceTCGA can automatically infer the phenotype of TCGA data using the sample names. Only if the phenotype matrix has more than 10 control samples, the program proceeds.

plotJunctionData

After running outspliceAnalysis or outspliceTCGA, the user can create plots of specified junctions using the plotJunctionData function.

The function plotJunctionData has 1 required input and 10 optional inputs

In the below example we can plot the expression levels for a “skipping” junction event on the ECM1 gene based on our example data from the TCGA.

data_file <- system.file("extdata", "OutSplice_Example_2023-01-06.RDa", package = "OutSplice")
ecm1_junc <- "chr1:150483674-150483933"
pdf <- "ecm1_expression.pdf"
pdf_output <- paste0(tempdir(), "/", pdf)
message("Output is located at: ", pdf_output)
## Output is located at: /tmp/RtmpdCxhCY/ecm1_expression.pdf
plotJunctionData(data_file, NUMBER = 1, junctions = ecm1_junc, tail = NULL, p_value = 0.05, GENE = FALSE, SYMBOL = NULL, makepdf = TRUE, pdffile = pdf_output, tumcol = "red", normcol = "blue")
## chr1:150483674-150483933

-data_file: filepath to an R Data file containing output from the outspliceAnalysis or outspliceTCGA functions.

-NUMBER: number of junctions to plot. This can be top number of junctions (over or under expressed), or can be specific junctions in a list. Default is 1 [OPTIONAL]

-junctions: you can input the specific junction you want to graph (or vector of junctions). Default is NULL [OPTIONAL]

-tail: you can specify if you want top over or under expressed with tail=‘RIGHT’ for junctions overexpressed in tumors, or tail=‘LEFT’ for junctions underexpressed in tumors. Default is NULL [OPTIONAL]

-p_value: p-value threshold to use when plotting the top over or under-expressed junctions with “tail”. Default is 0.05 [OPTIONAL]

-GENE: Pick junctions based on a specific gene. TRUE means you will pick all the junctions mapping to a certain gene. FALSE means you do not pick based on the gene. Default is False. [OPTIONAL]

-SYMBOL: HGNC SYMBOL of gene you want to graph [OPTIONAL]

-makepdf: Save graphs to a pdf? Default is no/FALSE [OPTIONAL]

-pdffile: if you want to save to a pdf, you will need to write the file path [OPTIONAL when makepdf = F]

-tumcol: color for tumors on graph. Default is red [OPTIONAL]

-normcol: color for normals on graph. Default is blue [OPTIONAL]

Methodology

The below sections describe the processes used in the above functions.

Junction RPM normalization

The program automatically normalizes the junction counts by dividing the junction counts by the total raw counts and then dividing each count by 10^6 to generate RPM junction data.

OGSA initial filtering

The dotheogsa function from the Bioconductor package OGSA is sourced to remove junctions that may not be biologically relevant due to low expression or that have any difference between tumor and normal. In this package, we set a 0.1 RPM expression threshold for pre-filtering.

OGSA outlier analysis

The dotheogsa function is again employed to determine splicing events as outliers, which are defined as any normalized junctions that are two standard deviations above or below the median of the normal distribution. A Fisher exact test is used to determine which junctions are significantly over or under expressed in tumors compared to the normal samples.

Genomic references

The Bioconductor GenomicRanges packages are used to assign each junction to a known gene. The user has the option in the main function to input which genome and its associated Bioconductor packages to use as the reference. These Bioconductor packages should include the genome object, annotations, and transcript annotations.

Ex) For mouse genomes aligned to mm39, install and load: Mus.musculus (genome object), org.Mm.eg.db (annotations), and TxDb.Mmusculus.UCSC.mm39.refGene (transcript annotations) using the library() function. Then, when using the OutSplice functions, specify org.Mm.eg.db for the annotation argument, and TxDb.Mmusculus.UCSC.mm39.refGene for the TxDb argument.

This process will also work for any genome provided the correct annotation package, its corresponding transcript annotation package, and their dependencies from Bioconductor are provided and loaded.

Using this genomic assignment, the dotheogsa function determines insertion, skipping, or deletion events based on the following criteria:

insertion: junction that starts outside a known exon

skipping: junction that skips over a known exon

deletion: junction that is inside a known exon but not as its start or end

Junction expression normalization

Junction expression is normalized based on its corresponding gene expression from the gene_expr input. This is achieved by dividing the junction RPM data by the normalized gene expression counts from a junction’s corresponding gene. If a junction is aligned to more than one gene, then the gene with the lower entrez id will be the one selected for the normalization.

Filter by expression via offsets

Offsets, which the user can specify, sets a minimum value relative to the normal samples in order to call a junction an outlier. The goal is to remove data with low expression that may not be biologically relevant. In this example example, an outlier junction must have a normalized expression greater than 0.00001 in order to be called an outlier. Any outliers with expressions below this value are too low to be relevant for the analysis in this example.

Splice Burden Calculation

Sums the number of splicing events in each sample that were marked as a TRUE outlier for both over-expressed and under-expressed events. The total number of outliers is then calculated as the sum of the over and under-expressed outliers.

Junction Plotting

Creates bar and waterfall plots of junction expression in both the tumor and normal samples. The data for these plots comes from the raw junction input, the gene expression values to reflect overall gene expression, and the junction expression normalized by gene expression.

Outputs

outSpliceAnalysis/outSpliceTCGA

Returns a list and, if specified, an R data file and tab deliminated text files with the following data:

-ASE.type: significant junction events labeled by type (skipping, insertion, or deletion)

-FisherAnalyses: Data Frame of junction events containing the number of under/over-expressed outliers in the tumor group (Num_UE_Outliers/Num_OE_Outliers), the Fisher p-value for under/over-expressed events (FisherP1/FisherP2), and a ranking of the under/over expressed events (UE_Rank/OE_Rank)

-geneAnnotations: object containing gene names corresponding to each junction region

junc.Outliers: list containing the logical matrices TumorOverExpression and TumorUnderExpression. “True” indicates an over-expressed event in TumorOverExpression, or an under-expressed event in TumorUnderExpression.

-junc.RPM: junction counts in reads per million following a division of the junction counts input by the total rawcounts for each sample

-junc.RPM.norm: junction counts normalized by each event’s total gene expression value

-gene_expr: gene expression values for each junction event

-splice_burden: matrix containing the number of Fisher-P significant over-expressed, under-expressed, and total number of outliers per sample

-NORM.gene_expr.norm: Median of junction data normalized by gene expression for normal samples only (Used for Junction Plotting Only)

-pheno: Phenotypes of Samples (Tumor or Normal)

-pvalues: Junction Fisher P-values

If file output was specified, the files are:

-Data File: _.RDa

-ASE.type: event_types.txt

-FisherAnalyses: FisherAnalyses.txt

-geneAnnotations: gene_annotations.txt

-junc.Outliers: TumorOverExpression.txt, TumorUnderExpression.txt

-splice_burden: splice_burden.txt

plotJunctionData

Outputs junction expression plots of user specified junctions as defined in Section 3.8. Plots can be saved to a user defined pdf file.

References

Broad Institute TCGA Genome Data Analysis Center (2016): Firehose stddata__2016_01_28 run. Broad Institute of MIT and Harvard. doi:10.7908/C11G0KM9

Cancer Genome Atlas Network. Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature. 2015 Jan 29;517(7536):576-82. doi: 10.1038/nature14129. PMID: 25631445; PMCID: PMC4311405.

Guo T, Sakai A, Afsari B, Considine M, Danilova L, Favorov AV, Yegnasubramanian S, Kelley DZ, Flam E, Ha PK, Khan Z, Wheelan SJ, Gutkind JS, Fertig EJ, Gaykalova DA, Califano J. A Novel Functional Splice Variant of AKT3 Defined by Analysis of Alternative Splice Expression in HPV-Positive Oropharyngeal Cancers. Cancer Res. 2017 Oct 1;77(19):5248-5258. doi: 10.1158/0008-5472.CAN-16-3106. Epub 2017 Jul 21. PMID: 28733453; PMCID: PMC6042297.

Liu C, Guo T, Sakai A, Ren S, Fukusumi T, Ando M, Sadat S, Saito Y, Califano JA. A novel splice variant of LOXL2 promotes progression of human papillomavirus-negative head and neck squamous cell carcinoma. Cancer. 2020 Feb 15;126(4):737-748. doi: 10.1002/cncr.32610. Epub 2019 Nov 13. PMID: 31721164.

Liu C, Guo T, Xu G, Sakai A, Ren S, Fukusumi T, Ando M, Sadat S, Saito Y, Khan Z, Fisch KM, Califano J. Characterization of Alternative Splicing Events in HPV-Negative Head and Neck Squamous Cell Carcinoma Identifies an Oncogenic DOCK5 Variant. Clin Cancer Res. 2018 Oct 15;24(20):5123-5132. doi: 10.1158/1078-0432.CCR-18-0752. Epub 2018 Jun 26. PMID: 29945995; PMCID: PMC6440699.

M. F. Ochs, J. E. Farrar, M. Considine, Y. Wei, S. Meshinchi, and R. J. Arceci. Outlier analysis and top scoring pair for integrated data analysis and biomarker discovery. IEEE/ACM Trans Comput Biol Bioinform, 11: 520-32, 2014. PMCID: PMC4156935

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] OutSplice_1.7.0  BiocStyle_2.35.0
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.47.0                         
##  [2] SummarizedExperiment_1.37.0             
##  [3] rjson_0.2.23                            
##  [4] xfun_0.49                               
##  [5] bslib_0.8.0                             
##  [6] lattice_0.22-6                          
##  [7] Biobase_2.67.0                          
##  [8] vctrs_0.6.5                             
##  [9] tools_4.4.2                             
## [10] bitops_1.0-9                            
## [11] generics_0.1.3                          
## [12] stats4_4.4.2                            
## [13] curl_6.0.1                              
## [14] parallel_4.4.2                          
## [15] AnnotationDbi_1.69.0                    
## [16] RSQLite_2.3.8                           
## [17] blob_1.2.4                              
## [18] pkgconfig_2.0.3                         
## [19] Matrix_1.7-1                            
## [20] S4Vectors_0.45.2                        
## [21] lifecycle_1.0.4                         
## [22] GenomeInfoDbData_1.2.13                 
## [23] compiler_4.4.2                          
## [24] Rsamtools_2.23.1                        
## [25] Biostrings_2.75.1                       
## [26] codetools_0.2-20                        
## [27] GenomeInfoDb_1.43.2                     
## [28] htmltools_0.5.8.1                       
## [29] sys_3.4.3                               
## [30] buildtools_1.0.0                        
## [31] sass_0.4.9                              
## [32] RCurl_1.98-1.16                         
## [33] yaml_2.3.10                             
## [34] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 
## [35] crayon_1.5.3                            
## [36] jquerylib_0.1.4                         
## [37] BiocParallel_1.41.0                     
## [38] DelayedArray_0.33.2                     
## [39] cachem_1.1.0                            
## [40] org.Hs.eg.db_3.20.0                     
## [41] abind_1.4-8                             
## [42] digest_0.6.37                           
## [43] restfulr_0.0.15                         
## [44] maketools_1.3.1                         
## [45] grid_4.4.2                              
## [46] fastmap_1.2.0                           
## [47] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0
## [48] SparseArray_1.7.2                       
## [49] cli_3.6.3                               
## [50] S4Arrays_1.7.1                          
## [51] GenomicFeatures_1.59.1                  
## [52] XML_3.99-0.17                           
## [53] UCSC.utils_1.3.0                        
## [54] bit64_4.5.2                             
## [55] rmarkdown_2.29                          
## [56] XVector_0.47.0                          
## [57] httr_1.4.7                              
## [58] matrixStats_1.4.1                       
## [59] bit_4.5.0                               
## [60] png_0.1-8                               
## [61] memoise_2.0.1                           
## [62] evaluate_1.0.1                          
## [63] knitr_1.49                              
## [64] GenomicRanges_1.59.1                    
## [65] IRanges_2.41.1                          
## [66] BiocIO_1.17.1                           
## [67] rtracklayer_1.67.0                      
## [68] rlang_1.1.4                             
## [69] DBI_1.2.3                               
## [70] BiocManager_1.30.25                     
## [71] BiocGenerics_0.53.3                     
## [72] jsonlite_1.8.9                          
## [73] R6_2.5.1                                
## [74] MatrixGenerics_1.19.0                   
## [75] GenomicAlignments_1.43.0                
## [76] zlibbioc_1.52.0