SPLINTER provides tools to analyze alternative splicing sites, interpret outcomes based on sequence information, select and design primers for site validiation and give visual representation of the event to guide downstream experiments.
In this example, we will be utilizing the mm9 genome for mouse. You will need to install the appropriate package (eg. BSgenome.Mmusculus.UCSC.mm9) for the genome that you will be using.
library(BSgenome.Mmusculus.UCSC.mm9)
library(GenomicFeatures)
bsgenome <- BSgenome.Mmusculus.UCSC.mm9
We begin with full set of available transcripts to screen from, and read it into a object. One source of this (best option to ensure compatibility) would be the GTF file that you have used for alternative splicing analysis. For other sources of data, please refer to GenomicFeatures).
We then extract the coding sequences (CDS), and transcripts in general (coding and non-coding) (exons) from this object.
data_path<-system.file("extdata",package="SPLINTER")
gtf_file<-paste(data_path,"/Mus_musculus.Ensembl.NCBIM37.65.partial.gtf",sep="")
library(txdbmaker)
txdb <- makeTxDbFromGFF(file=gtf_file,chrominfo = Seqinfo(genome="mm9"))
## Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version information
## is not available for this TxDb object
# txdb generation can take quite long, you can save the object and load it the next time
# saveDb(txdb,file="txdb_hg19.sqlite")
# txdb<-loadDb(file="txdb_hg19.sqlite")
# extract CDS and exon information from TxDb object
thecds<-cdsBy(txdb,by="tx",use.names=TRUE)
theexons<-exonsBy(txdb,by="tx",use.names=TRUE)
The output file from MATS is used here, but essentially all that is needed are coordinates of the exons (target and flanking) involved in the splicing processt to be studied. For the case of exon skipping, this will include the upstream, target and downstream exons. More output types will be supported in the future.
The following types of alternative splicing events are accepted:
Type of alternative splicing event | Definition |
---|---|
SE | Skipped exon |
RI | Retained intron |
MXE | Mutually exclusive exon |
A5SS | Alternative 5’ splice site |
A3SS | Alternative 3’ splice site |
typeofAS="SE"
mats_file<-paste(data_path,"/skipped_exons.txt",sep="")
splice_data <-extractSpliceEvents(data=mats_file, filetype='mats', splicetype=typeofAS)
splice_data$data[,c(1:10)]
## ID Symbol chr strand exonStart exonEnd
## 4825 ENSMUSG00000052337 ENSMUSG00000052337 chr6 + 71816720 71816734
## 17227 ENSMUSG00000023110 ENSMUSG00000023110 chr14 - 55132128 55132291
## 22583 ENSMUSG00000079477 ENSMUSG00000079477 chr6 - 87965626 87965712
## 13693 ENSMUSG00000024911 ENSMUSG00000024911 chr19 + 5464334 5464431
## 18826 ENSMUSG00000027940 ENSMUSG00000027940 chr3 + 89894935 89895013
## upstreamStart upstreamEnd downstreamStart downstreamEnd
## 4825 71813135 71813264 71818574 71818797
## 17227 55130830 55130991 55133434 55133483
## 22583 87963632 87963692 87995067 87995239
## 13693 5464129 5464215 5464925 5465051
## 18826 89893932 89894001 89903450 89904487
SPLINTER assumes that the main identifier is ENSEMBL, however gene symbols can be added.
splice_data<-addEnsemblAnnotation(data=splice_data,species="mmusculus")
# (Optional) Sorting the dataframe, if you have supporting statistical information
splice_data$data<-splice_data$data[with(splice_data$data,order(FDR,-IncLevelDifference)),]
head(splice_data$data[,c(1:10)])
## ID Symbol chr strand exonStart exonEnd upstreamStart
## 1 ENSMUSG00000052337 Immt chr6 + 71816720 71816734 71813135
## 2 ENSMUSG00000023110 Prmt5 chr14 - 55132128 55132291 55130830
## 3 ENSMUSG00000079477 Rab7 chr6 - 87965626 87965712 87963632
## 4 ENSMUSG00000024911 Fibp chr19 + 5464334 5464431 5464129
## 5 ENSMUSG00000027940 Tpm3 chr3 + 89894935 89895013 89893932
## upstreamEnd downstreamStart downstreamEnd
## 1 71813264 71818574 71818797
## 2 55130991 55133434 55133483
## 3 87963692 87995067 87995239
## 4 5464215 5464925 5465051
## 5 89894001 89903450 89904487
Once we have defined the events, we will pick 1 event to analyze.
single_id='Prmt5'
pp<-which(grepl(single_id,splice_data$data$Symbol)) # Prmt5 has 1 record
splice_data$data[pp,c(1:6)] # show all records
## ID Symbol chr strand exonStart exonEnd
## 2 ENSMUSG00000023110 Prmt5 chr14 - 55132128 55132291
To reduce search complexity, we define the valid transcripts and coding sequences with regards to our gene of interest. We find that Prmt5 has 7 transcripts, 2 of which are coding sequences.
The makeROI
function will create a list containing
GRanges
objects for the splicing event. This will help
identify and construct relevant outputs later.
This list contains the following information:
Type of alternative splicing | Type 1 representation | Type 2 representation (annotated only) |
---|---|---|
SE | isoform with event exon included | isoform with the exon skipped |
RI | isoform with normal exon boundaries | isoform with the intron retained |
MXE | isoform defined 1st (leftmost) in input | isoform defined 2nd in input |
A5SS | isoform with longer exon | isoform with shorter exon |
A3SS | isoform with longer exon | isoform with shorter exon |
## $type
## [1] "SE"
##
## $name
## [1] "ENSMUSG00000023110"
##
## $roi
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | exon_rank
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr14 55132128-55132291 - | 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $flank
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | exon_rank
## <Rle> <IRanges> <Rle> | <integer>
## [1] chr14 55133434-55133483 - | 2
## [2] chr14 55130830-55130991 - | 1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $roi_range
## GRangesList object of length 2:
## [[1]]
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr14 55133434-55133483 -
## [2] chr14 55132128-55132291 -
## [3] chr14 55130830-55130991 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## [[2]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr14 55133434-55133483 -
## [2] chr14 55130830-55130991 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
At this juncture, we look for transcripts are compatible with the ROI. Compatibility is defined as having the exact cassette (matching upstream, target, downstream) exons. In the case of intron retention, this would just be the 2 exons flanking the intron.
We notice here that Prmt5 only has 1 compatible transcript involved in the event ROI, out of 7 transcripts (or 2 coding transcripts). There are no Type 2 transcripts, which means there are no annotated transcripts of Prmt5 containing the alternative event.
## Checking Type 1.....
## ENSMUST00000023873
## Checking Type 2.....
## No transcripts found!
## 1 match(es) from original 7 transcripts.
## Checking Type 1.....
## ENSMUST00000023873
## Checking Type 2.....
## No transcripts found!
## 1 match(es) from original 2 transcripts.
event<-eventOutcomeCompare(seq1=compatible_cds$hits[[1]],seq2=region_minus_exon,
genome=bsgenome,direction=TRUE,fullseq=FALSE)
##
## ### ENSMUST00000023873 ###
## Nonsense mediated decay.
## - early termination
## middle insertion GILKPK (247-252)
## middle deletion LEIGADLP (206-213)
## middle deletion EPIK (224-227)
## middle deletion KAAIL (227-231)
## multiple mismatch sites
## 3' end mismatch : ALEIGADLPSNHVIDRWLGEPIKAAILPTSIFLTNKKGFPVLSKVQQRLIFRLLKLEVQFIITGTNHHSEKEFCSYLQYLEYLSQNRPPPNAYELFAKGYEDYLQSPLQPLMDNLESQTYEVFEKDPIKYSQYQQAIYKCLLDRVPEEEKETNVQVLMVLGAGRGPLVNASLRAAKQAERRIRLYAVEKNPNAVVTLENWQFEEWGSQVTVVSSDMREWVAPEKADIIVSELLGSFADNELSPECLDGAQHFLKDDGVSIPGEYTSFLAPISSSKLYNEVRACREKDRDPEAQFEMPYVVRLHNFHQLSAPKPCFTFSHPNRDPMIDNNRYCTLEFPVEVNTVLHGFAGYFETVLYRDITLSIRPETHSPGMFSWFPIFFPIKQPITVHEGQNICVRFWRCSNSKKVWYEWAVTAPVCSSIHNPTGRSYTIGL* to VGSAVYHHGNQPPLREGVLFLPPVLGILKPKSPSTQCL*
## length : 637 AA to 242 AA
## $alignment
## Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: MAAMAVGGAGGSRVSSGRDLNCVPEIADTLGA...GVL----FLP-----PVLGILKPKSPSTQCL*
## subject: [1] MAAMAVGGAGGSRVSSGRDLNCVPEIADTLGA...AILPTSIFLTNKKGFPVL------SKVQQRLI
## score: 1085.5
##
## $eventtypes
## [1] "(NMD)"
This function will return the DNA of the ROI, with exons separated by
“[]” (Primer3 notation) and the junction marked by
jstart
.
## $seq
## [1] "GTGGCATAACTTTCGGACTCTGTGTGACTATAGCAAGAGAATTGCAGTAG[]TTGGAAGTGCAGTTTATCATCACGGGAACCAACCACCACTCAGAGAAGGAGTTCTGTTCCTACCTCCAGTACTTGGAATACTTAAGCCAAAATCGCCCTCCACCCAATGCCTATGAGCTCTTTGCCAAAGGCTATGAAGACTATCTGCAGTCCCCACTCCAG"
##
## $jstart
## [1] 51
We have included a helper function to run Primer3 from within
R
. You will need to define the path to your Primer3
installation. Refer to ?callPrimer3
for more details.
## i PRIMER_LEFT_SEQUENCE PRIMER_RIGHT_SEQUENCE PRIMER_LEFT_TM
## 1 0 ACTTTCGGACTCTGTGTGACT TCATAGGCATTGGGTGGAGG 58.967
## 2 1 TGGCATAACTTTCGGACTCTG GGAGTGGGGACTGCAGATAG 58.027
## 3 2 GCATAACTTTCGGACTCTGTGT CTCCTTCTCTGAGTGGTGGT 58.673
## 4 3 TCGGACTCTGTGTGACTATAGC ATTGGGTGGAGGGCGATTTT 59.056
## 5 4 GGACTCTGTGTGACTATAGCAAG GCTCATAGGCATTGGGTGG 58.322
Alternatively, primers can be entered manually with the appropriate headers.
As a confirmation, we can run the primers against the ROI to give the genomic location of the primer coverage.
## $total_span
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr14 55130876-55133475 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $primer_left_span
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr14 55133455-55133475 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
##
## $primer_right_span
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr14 55130876-55130895 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
getPCRsizes
will give you the length of the PCR product
produced by the set of primers.
## ID bp
## 1 ENSMUST00000023873 322
tx_minus_exon <-removeRegion(compatible_tx$hits[[1]],roi)
pcr_result2<-getPCRsizes(cp,tx_minus_exon)
pcr_result2
## ID bp
## 1 ENSMUST00000023873 158
While getPCRsizes
will return all possible PCR products
for a given set of annotation, splitPCRhit
will return PCR
product sizes that are relevant to the splicing event in question.
## ID bp type
## 1 ENSMUST00000023873 322 1
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] txdbmaker_1.3.1 GenomicFeatures_1.59.1
## [3] AnnotationDbi_1.69.0 Biobase_2.67.0
## [5] BSgenome.Mmusculus.UCSC.mm9_1.4.0 BSgenome_1.75.0
## [7] rtracklayer_1.67.0 BiocIO_1.17.1
## [9] Biostrings_2.75.3 XVector_0.47.0
## [11] GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
## [13] IRanges_2.41.2 S4Vectors_0.45.2
## [15] BiocGenerics_0.53.3 generics_0.1.3
## [17] SPLINTER_1.33.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] rstudioapi_0.17.1 jsonlite_1.8.9
## [5] magrittr_2.0.3 farver_2.1.2
## [7] rmarkdown_2.29 zlibbioc_1.52.0
## [9] vctrs_0.6.5 memoise_2.0.1
## [11] Rsamtools_2.23.1 RCurl_1.98-1.16
## [13] base64enc_0.1-3 htmltools_0.5.8.1
## [15] S4Arrays_1.7.1 progress_1.2.3
## [17] curl_6.0.1 SparseArray_1.7.2
## [19] Formula_1.2-5 sass_0.4.9
## [21] bslib_0.8.0 htmlwidgets_1.6.4
## [23] plyr_1.8.9 Gviz_1.51.0
## [25] httr2_1.0.7 cachem_1.1.0
## [27] buildtools_1.0.0 GenomicAlignments_1.43.0
## [29] lifecycle_1.0.4 pkgconfig_2.0.3
## [31] Matrix_1.7-1 R6_2.5.1
## [33] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [35] MatrixGenerics_1.19.0 digest_0.6.37
## [37] colorspace_2.1-1 Hmisc_5.2-1
## [39] RSQLite_2.3.9 labeling_0.4.3
## [41] seqLogo_1.73.0 filelock_1.0.3
## [43] httr_1.4.7 abind_1.4-8
## [45] compiler_4.4.2 withr_3.0.2
## [47] bit64_4.5.2 htmlTable_2.4.3
## [49] backports_1.5.0 BiocParallel_1.41.0
## [51] DBI_1.2.3 biomaRt_2.63.0
## [53] rappdirs_0.3.3 DelayedArray_0.33.3
## [55] rjson_0.2.23 tools_4.4.2
## [57] foreign_0.8-87 nnet_7.3-19
## [59] glue_1.8.0 restfulr_0.0.15
## [61] grid_4.4.2 checkmate_2.3.2
## [63] cluster_2.1.8 gtable_0.3.6
## [65] ensembldb_2.31.0 data.table_1.16.4
## [67] hms_1.1.3 xml2_1.3.6
## [69] pillar_1.10.0 stringr_1.5.1
## [71] dplyr_1.1.4 BiocFileCache_2.15.0
## [73] lattice_0.22-6 bit_4.5.0.1
## [75] deldir_2.0-4 biovizBase_1.55.0
## [77] tidyselect_1.2.1 googleVis_0.7.3
## [79] maketools_1.3.1 knitr_1.49
## [81] gridExtra_2.3 ProtGenerics_1.39.0
## [83] SummarizedExperiment_1.37.0 xfun_0.49
## [85] matrixStats_1.4.1 stringi_1.8.4
## [87] UCSC.utils_1.3.0 lazyeval_0.2.2
## [89] yaml_2.3.10 evaluate_1.0.1
## [91] codetools_0.2-20 interp_1.1-6
## [93] tibble_3.2.1 BiocManager_1.30.25
## [95] cli_3.6.3 rpart_4.1.23
## [97] munsell_0.5.1 jquerylib_0.1.4
## [99] dichromat_2.0-0.1 Rcpp_1.0.13-1
## [101] dbplyr_2.5.0 png_0.1-8
## [103] XML_3.99-0.17 parallel_4.4.2
## [105] ggplot2_3.5.1 blob_1.2.4
## [107] prettyunits_1.2.0 latticeExtra_0.6-30
## [109] jpeg_0.1-10 AnnotationFilter_1.31.0
## [111] bitops_1.0-9 pwalign_1.3.1
## [113] VariantAnnotation_1.53.0 scales_1.3.0
## [115] purrr_1.0.2 crayon_1.5.3
## [117] rlang_1.1.4 KEGGREST_1.47.0