SPLINTER

Introduction

SPLINTER provides tools to analyze alternative splicing sites, interpret outcomes based on sequence information, select and design primers for site validiation and give visual representation of the event to guide downstream experiments.

Loading the package

To load the SPLINTER package:

library(SPLINTER)

Initializing the genome for transcript selection

In this example, we will be utilizing the mm9 genome for mouse. You will need to install the appropriate package (eg. BSgenome.Mmusculus.UCSC.mm9) for the genome that you will be using.

library(BSgenome.Mmusculus.UCSC.mm9)
library(GenomicFeatures)
bsgenome <- BSgenome.Mmusculus.UCSC.mm9

We begin with full set of available transcripts to screen from, and read it into a object. One source of this (best option to ensure compatibility) would be the GTF file that you have used for alternative splicing analysis. For other sources of data, please refer to GenomicFeatures).

We then extract the coding sequences (CDS), and transcripts in general (coding and non-coding) (exons) from this object.

data_path<-system.file("extdata",package="SPLINTER")
gtf_file<-paste(data_path,"/Mus_musculus.Ensembl.NCBIM37.65.partial.gtf",sep="")

library(txdbmaker)
txdb <- makeTxDbFromGFF(file=gtf_file,chrominfo = Seqinfo(genome="mm9"))
## Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information
## is not available for this TxDb object
# txdb generation can take quite long, you can save the object and load it the next time
# saveDb(txdb,file="txdb_hg19.sqlite")
# txdb<-loadDb(file="txdb_hg19.sqlite")

# extract CDS and exon information from TxDb object
thecds<-cdsBy(txdb,by="tx",use.names=TRUE)
theexons<-exonsBy(txdb,by="tx",use.names=TRUE)

Reading in the splicing analysis file

The output file from MATS is used here, but essentially all that is needed are coordinates of the exons (target and flanking) involved in the splicing processt to be studied. For the case of exon skipping, this will include the upstream, target and downstream exons. More output types will be supported in the future.

The following types of alternative splicing events are accepted:

Type of alternative splicing event Definition
SE Skipped exon
RI Retained intron
MXE Mutually exclusive exon
A5SS Alternative 5’ splice site
A3SS Alternative 3’ splice site
typeofAS="SE"
mats_file<-paste(data_path,"/skipped_exons.txt",sep="")
splice_data <-extractSpliceEvents(data=mats_file, filetype='mats', splicetype=typeofAS)
splice_data$data[,c(1:10)]
##                       ID             Symbol   chr strand exonStart  exonEnd
## 4825  ENSMUSG00000052337 ENSMUSG00000052337  chr6      +  71816720 71816734
## 17227 ENSMUSG00000023110 ENSMUSG00000023110 chr14      -  55132128 55132291
## 22583 ENSMUSG00000079477 ENSMUSG00000079477  chr6      -  87965626 87965712
## 13693 ENSMUSG00000024911 ENSMUSG00000024911 chr19      +   5464334  5464431
## 18826 ENSMUSG00000027940 ENSMUSG00000027940  chr3      +  89894935 89895013
##       upstreamStart upstreamEnd downstreamStart downstreamEnd
## 4825       71813135    71813264        71818574      71818797
## 17227      55130830    55130991        55133434      55133483
## 22583      87963632    87963692        87995067      87995239
## 13693       5464129     5464215         5464925       5465051
## 18826      89893932    89894001        89903450      89904487

Additional annotation

SPLINTER assumes that the main identifier is ENSEMBL, however gene symbols can be added.

splice_data<-addEnsemblAnnotation(data=splice_data,species="mmusculus")

# (Optional) Sorting the dataframe, if you have supporting statistical information
splice_data$data<-splice_data$data[with(splice_data$data,order(FDR,-IncLevelDifference)),]
head(splice_data$data[,c(1:10)])
##                   ID Symbol   chr strand exonStart  exonEnd upstreamStart
## 1 ENSMUSG00000052337   Immt  chr6      +  71816720 71816734      71813135
## 2 ENSMUSG00000023110  Prmt5 chr14      -  55132128 55132291      55130830
## 3 ENSMUSG00000079477   Rab7  chr6      -  87965626 87965712      87963632
## 4 ENSMUSG00000024911   Fibp chr19      +   5464334  5464431       5464129
## 5 ENSMUSG00000027940   Tpm3  chr3      +  89894935 89895013      89893932
##   upstreamEnd downstreamStart downstreamEnd
## 1    71813264        71818574      71818797
## 2    55130991        55133434      55133483
## 3    87963692        87995067      87995239
## 4     5464215         5464925       5465051
## 5    89894001        89903450      89904487

Analyzing a specific gene

Inspecting a single gene in more detail (single record)

Once we have defined the events, we will pick 1 event to analyze.

single_id='Prmt5'
pp<-which(grepl(single_id,splice_data$data$Symbol)) # Prmt5 has 1 record

splice_data$data[pp,c(1:6)] # show all records
##                   ID Symbol   chr strand exonStart  exonEnd
## 2 ENSMUSG00000023110  Prmt5 chr14      -  55132128 55132291
single_record<-splice_data$data[pp[1],]

Finding relevant transcripts from the ENSEMBL database

To reduce search complexity, we define the valid transcripts and coding sequences with regards to our gene of interest. We find that Prmt5 has 7 transcripts, 2 of which are coding sequences.

valid_tx <- findTX(id=single_record$ID,tx=theexons,db=txdb)

valid_cds<- findTX(id=single_record$ID,tx=thecds,db=txdb)

Constructing the region of interest (ROI)

The makeROI function will create a list containing GRanges objects for the splicing event. This will help identify and construct relevant outputs later.

This list contains the following information:

  • type: type of alternative splicing event
  • name: name of gene
  • roi: GRanges object of the exon
  • flank: GRanges object of the flanking exons
  • roi_range: GRanges list containing
    • GRanges object of Type 1
    • GRanges object of Type 2
Type of alternative splicing Type 1 representation Type 2 representation (annotated only)
SE isoform with event exon included isoform with the exon skipped
RI isoform with normal exon boundaries isoform with the intron retained
MXE isoform defined 1st (leftmost) in input isoform defined 2nd in input
A5SS isoform with longer exon isoform with shorter exon
A3SS isoform with longer exon isoform with shorter exon
roi <- makeROI(single_record,type="SE")
roi
## $type
## [1] "SE"
## 
## $name
## [1] "ENSMUSG00000023110"
## 
## $roi
## GRanges object with 1 range and 1 metadata column:
##       seqnames            ranges strand | exon_rank
##          <Rle>         <IRanges>  <Rle> | <integer>
##   [1]    chr14 55132128-55132291      - |         1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $flank
## GRanges object with 2 ranges and 1 metadata column:
##       seqnames            ranges strand | exon_rank
##          <Rle>         <IRanges>  <Rle> | <integer>
##   [1]    chr14 55133434-55133483      - |         2
##   [2]    chr14 55130830-55130991      - |         1
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $roi_range
## GRangesList object of length 2:
## [[1]]
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr14 55133434-55133483      -
##   [2]    chr14 55132128-55132291      -
##   [3]    chr14 55130830-55130991      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## [[2]]
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr14 55133434-55133483      -
##   [2]    chr14 55130830-55130991      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Finding transcripts that contain the ROI

At this juncture, we look for transcripts are compatible with the ROI. Compatibility is defined as having the exact cassette (matching upstream, target, downstream) exons. In the case of intron retention, this would just be the 2 exons flanking the intron.

We notice here that Prmt5 only has 1 compatible transcript involved in the event ROI, out of 7 transcripts (or 2 coding transcripts). There are no Type 2 transcripts, which means there are no annotated transcripts of Prmt5 containing the alternative event.

compatible_tx<- findCompatibleEvents(valid_tx,roi=roi,verbose=TRUE)
## Checking Type 1.....
## ENSMUST00000023873
## Checking Type 2.....
## No transcripts found!
## 1 match(es) from original 7 transcripts.
compatible_cds<- findCompatibleEvents(valid_cds,valid_tx,roi=roi,verbose=TRUE)
## Checking Type 1.....
## ENSMUST00000023873
## Checking Type 2.....
## No transcripts found!
## 1 match(es) from original 2 transcripts.

Simulating alternatively spliced products

Simulating the outcome of exon skipping by removing an exonic region

region_minus_exon <-removeRegion(compatible_cds$hits[[1]],roi)

Simulating the outcome of intron retention by inserting an intronic region

# Not relevant for this Prmt5 skipped exon example
region_plus_exon <-insertRegion(region_minus_exon,roi)

Comparing sequences before and after removal/insertion of a region

event<-eventOutcomeCompare(seq1=compatible_cds$hits[[1]],seq2=region_minus_exon,
                    genome=bsgenome,direction=TRUE,fullseq=FALSE)
## 
## ### ENSMUST00000023873 ###
## Nonsense mediated decay.
## -   early termination
## middle insertion GILKPK (247-252)
## middle deletion LEIGADLP (206-213)
## middle deletion EPIK (224-227)
## middle deletion KAAIL (227-231)
## multiple mismatch sites
## 3' end mismatch : ALEIGADLPSNHVIDRWLGEPIKAAILPTSIFLTNKKGFPVLSKVQQRLIFRLLKLEVQFIITGTNHHSEKEFCSYLQYLEYLSQNRPPPNAYELFAKGYEDYLQSPLQPLMDNLESQTYEVFEKDPIKYSQYQQAIYKCLLDRVPEEEKETNVQVLMVLGAGRGPLVNASLRAAKQAERRIRLYAVEKNPNAVVTLENWQFEEWGSQVTVVSSDMREWVAPEKADIIVSELLGSFADNELSPECLDGAQHFLKDDGVSIPGEYTSFLAPISSSKLYNEVRACREKDRDPEAQFEMPYVVRLHNFHQLSAPKPCFTFSHPNRDPMIDNNRYCTLEFPVEVNTVLHGFAGYFETVLYRDITLSIRPETHSPGMFSWFPIFFPIKQPITVHEGQNICVRFWRCSNSKKVWYEWAVTAPVCSSIHNPTGRSYTIGL* to VGSAVYHHGNQPPLREGVLFLPPVLGILKPKSPSTQCL*
## length : 637 AA to 242 AA
event
## $alignment
## Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern:     MAAMAVGGAGGSRVSSGRDLNCVPEIADTLGA...GVL----FLP-----PVLGILKPKSPSTQCL*
## subject: [1] MAAMAVGGAGGSRVSSGRDLNCVPEIADTLGA...AILPTSIFLTNKKGFPVL------SKVQQRLI
## score: 1085.5 
## 
## $eventtypes
## [1] "(NMD)"

Designing primers to inspect splicing regions

Getting the DNA of the region of interest

This function will return the DNA of the ROI, with exons separated by “[]” (Primer3 notation) and the junction marked by jstart.

aa<-getRegionDNA(roi,bsgenome) 
aa
## $seq
## [1] "GTGGCATAACTTTCGGACTCTGTGTGACTATAGCAAGAGAATTGCAGTAG[]TTGGAAGTGCAGTTTATCATCACGGGAACCAACCACCACTCAGAGAAGGAGTTCTGTTCCTACCTCCAGTACTTGGAATACTTAAGCCAAAATCGCCCTCCACCCAATGCCTATGAGCTCTTTGCCAAAGGCTATGAAGACTATCTGCAGTCCCCACTCCAG"
## 
## $jstart
## [1] 51

Using Primer3 to design primers for alternative splicing identification

We have included a helper function to run Primer3 from within R. You will need to define the path to your Primer3 installation. Refer to ?callPrimer3 for more details.

primers<-callPrimer3(seq=aa$seq,sequence_target = aa$jstart,size_range='100-500')
primers[,c(1:4)]
##   i    PRIMER_LEFT_SEQUENCE PRIMER_RIGHT_SEQUENCE PRIMER_LEFT_TM
## 1 0   ACTTTCGGACTCTGTGTGACT  TCATAGGCATTGGGTGGAGG         58.967
## 2 1   TGGCATAACTTTCGGACTCTG  GGAGTGGGGACTGCAGATAG         58.027
## 3 2  GCATAACTTTCGGACTCTGTGT  CTCCTTCTCTGAGTGGTGGT         58.673
## 4 3  TCGGACTCTGTGTGACTATAGC  ATTGGGTGGAGGGCGATTTT         59.056
## 5 4 GGACTCTGTGTGACTATAGCAAG   GCTCATAGGCATTGGGTGG         58.322

Alternatively, primers can be entered manually with the appropriate headers.

primers <- data.frame(PRIMER_LEFT_SEQUENCE="ACTTTCGGACTCTGTGTGACT",
                      PRIMER_RIGHT_SEQUENCE="TCATAGGCATTGGGTGGAGG",
                      stringsAsFactors=FALSE)

Checking primers coverage

As a confirmation, we can run the primers against the ROI to give the genomic location of the primer coverage.

cp<-checkPrimer(primers[1,],bsgenome,roi)

cp
## $total_span
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr14 55130876-55133475      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $primer_left_span
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr14 55133455-55133475      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
## 
## $primer_right_span
## GRanges object with 1 range and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr14 55130876-55130895      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Predicting PCR results using the primers

getPCRsizes will give you the length of the PCR product produced by the set of primers.

pcr_result1<-getPCRsizes(cp,theexons)
pcr_result1
##                   ID  bp
## 1 ENSMUST00000023873 322
tx_minus_exon <-removeRegion(compatible_tx$hits[[1]],roi)
pcr_result2<-getPCRsizes(cp,tx_minus_exon)
pcr_result2
##                   ID  bp
## 1 ENSMUST00000023873 158

Selecting sizes relevant to splicing event (subset of getPCRsizes)

While getPCRsizes will return all possible PCR products for a given set of annotation, splitPCRhit will return PCR product sizes that are relevant to the splicing event in question.

relevant_pcr_bands<-splitPCRhit(pcr_result1,compatible_tx)

relevant_pcr_bands
##                   ID  bp type
## 1 ENSMUST00000023873 322    1

Plotting results

# reading in BAM files
mt<-paste(data_path,"/mt_chr14.bam",sep="")
wt<-paste(data_path,"/wt_chr14.bam",sep="")

# Plotting genomic range, read density and splice changes
eventPlot(transcripts=theexons,roi_plot=roi,bams=c(wt,mt),names=c('wt','mt'),
          annoLabel=single_id,rspan=2000)

# Barplot of PSI values if provided
psiPlot(single_record)

Session info

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] txdbmaker_1.3.1                   GenomicFeatures_1.59.1           
##  [3] AnnotationDbi_1.69.0              Biobase_2.67.0                   
##  [5] BSgenome.Mmusculus.UCSC.mm9_1.4.0 BSgenome_1.75.0                  
##  [7] rtracklayer_1.67.0                BiocIO_1.17.1                    
##  [9] Biostrings_2.75.3                 XVector_0.47.0                   
## [11] GenomicRanges_1.59.1              GenomeInfoDb_1.43.2              
## [13] IRanges_2.41.2                    S4Vectors_0.45.2                 
## [15] BiocGenerics_0.53.3               generics_0.1.3                   
## [17] SPLINTER_1.33.0                   BiocStyle_2.35.0                 
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          sys_3.4.3                  
##   [3] rstudioapi_0.17.1           jsonlite_1.8.9             
##   [5] magrittr_2.0.3              farver_2.1.2               
##   [7] rmarkdown_2.29              zlibbioc_1.52.0            
##   [9] vctrs_0.6.5                 memoise_2.0.1              
##  [11] Rsamtools_2.23.1            RCurl_1.98-1.16            
##  [13] base64enc_0.1-3             htmltools_0.5.8.1          
##  [15] S4Arrays_1.7.1              progress_1.2.3             
##  [17] curl_6.0.1                  SparseArray_1.7.2          
##  [19] Formula_1.2-5               sass_0.4.9                 
##  [21] bslib_0.8.0                 htmlwidgets_1.6.4          
##  [23] plyr_1.8.9                  Gviz_1.51.0                
##  [25] httr2_1.0.7                 cachem_1.1.0               
##  [27] buildtools_1.0.0            GenomicAlignments_1.43.0   
##  [29] lifecycle_1.0.4             pkgconfig_2.0.3            
##  [31] Matrix_1.7-1                R6_2.5.1                   
##  [33] fastmap_1.2.0               GenomeInfoDbData_1.2.13    
##  [35] MatrixGenerics_1.19.0       digest_0.6.37              
##  [37] colorspace_2.1-1            Hmisc_5.2-1                
##  [39] RSQLite_2.3.9               labeling_0.4.3             
##  [41] seqLogo_1.73.0              filelock_1.0.3             
##  [43] httr_1.4.7                  abind_1.4-8                
##  [45] compiler_4.4.2              withr_3.0.2                
##  [47] bit64_4.5.2                 htmlTable_2.4.3            
##  [49] backports_1.5.0             BiocParallel_1.41.0        
##  [51] DBI_1.2.3                   biomaRt_2.63.0             
##  [53] rappdirs_0.3.3              DelayedArray_0.33.3        
##  [55] rjson_0.2.23                tools_4.4.2                
##  [57] foreign_0.8-87              nnet_7.3-19                
##  [59] glue_1.8.0                  restfulr_0.0.15            
##  [61] grid_4.4.2                  checkmate_2.3.2            
##  [63] cluster_2.1.8               gtable_0.3.6               
##  [65] ensembldb_2.31.0            data.table_1.16.4          
##  [67] hms_1.1.3                   xml2_1.3.6                 
##  [69] pillar_1.10.0               stringr_1.5.1              
##  [71] dplyr_1.1.4                 BiocFileCache_2.15.0       
##  [73] lattice_0.22-6              bit_4.5.0.1                
##  [75] deldir_2.0-4                biovizBase_1.55.0          
##  [77] tidyselect_1.2.1            googleVis_0.7.3            
##  [79] maketools_1.3.1             knitr_1.49                 
##  [81] gridExtra_2.3               ProtGenerics_1.39.0        
##  [83] SummarizedExperiment_1.37.0 xfun_0.49                  
##  [85] matrixStats_1.4.1           stringi_1.8.4              
##  [87] UCSC.utils_1.3.0            lazyeval_0.2.2             
##  [89] yaml_2.3.10                 evaluate_1.0.1             
##  [91] codetools_0.2-20            interp_1.1-6               
##  [93] tibble_3.2.1                BiocManager_1.30.25        
##  [95] cli_3.6.3                   rpart_4.1.23               
##  [97] munsell_0.5.1               jquerylib_0.1.4            
##  [99] dichromat_2.0-0.1           Rcpp_1.0.13-1              
## [101] dbplyr_2.5.0                png_0.1-8                  
## [103] XML_3.99-0.17               parallel_4.4.2             
## [105] ggplot2_3.5.1               blob_1.2.4                 
## [107] prettyunits_1.2.0           latticeExtra_0.6-30        
## [109] jpeg_0.1-10                 AnnotationFilter_1.31.0    
## [111] bitops_1.0-9                pwalign_1.3.1              
## [113] VariantAnnotation_1.53.0    scales_1.3.0               
## [115] purrr_1.0.2                 crayon_1.5.3               
## [117] rlang_1.1.4                 KEGGREST_1.47.0