Introduction to wiggleplotr

wiggleplotr is a tool to visualise RNA-seq read overage accross annotated exons. A key feature of wiggleplotr is that it is able rescale all introns of a gene to fixed length, making it easier to see differences in read coverage between neighbouring exons that can otherwise be too far away. Since wiggleplotr takes standard BigWig files as input, it can also be used to visualise read overage from other sequencing-based assays such as ATAC-seq and ChIP-seq.

##Getting started To install wiggleplotr, start R and enter:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("wiggleplotr")

To run the code examples shown in this vignette, we need to load the following packages:

library("wiggleplotr")
library("dplyr")
library("GenomicRanges")
library("GenomicFeatures")
library("biomaRt")

##Visualizing transcript annotations First, the plotTranscripts function allows you to visualise the structucte of all transcripts of a gene (or multiple genes). It takes the following three inputs, but only the first one is required:

  • exons - list of GRanges objects containing the start and end coordinates of exons for each transcript.
  • cdss - list of GRanges objects containing the start and end coordinates of coding sequence (cds) for each transcript (optional).
  • annotations - a data frame with at least the following three columns: transcript_id, gene_name and strand (optional).

To get you started, the wiggleplotr package comes with example annotations for the nine protein coding transcripts of the NCOA7 gene pre-loaded. Please see below to learn how to download those annotations from Ensembl on your own or how to extract them automatically from the EnsDb and TxDb objects. This is what the annotations look like:

ncoa7_metadata
## # A tibble: 9 × 4
##   transcript_id   gene_id         gene_name strand
##   <chr>           <chr>           <chr>      <int>
## 1 ENST00000368357 ENSG00000111912 NCOA7          1
## 2 ENST00000453302 ENSG00000111912 NCOA7          1
## 3 ENST00000417494 ENSG00000111912 NCOA7          1
## 4 ENST00000229634 ENSG00000111912 NCOA7          1
## 5 ENST00000428318 ENSG00000111912 NCOA7          1
## 6 ENST00000419660 ENSG00000111912 NCOA7          1
## 7 ENST00000438495 ENSG00000111912 NCOA7          1
## 8 ENST00000444128 ENSG00000111912 NCOA7          1
## 9 ENST00000392477 ENSG00000111912 NCOA7          1
names(ncoa7_exons)
## [1] "ENST00000368357" "ENST00000453302" "ENST00000417494" "ENST00000229634"
## [5] "ENST00000428318" "ENST00000419660" "ENST00000438495" "ENST00000444128"
## [9] "ENST00000392477"
names(ncoa7_cdss)
## [1] "ENST00000368357" "ENST00000453302" "ENST00000417494" "ENST00000229634"
## [5] "ENST00000428318" "ENST00000419660" "ENST00000438495" "ENST00000444128"
## [9] "ENST00000392477"

Now, to plot these nine transcripts, we simply use the plotTranscripts function:

plotTranscripts(ncoa7_exons, ncoa7_cdss, ncoa7_metadata, rescale_introns = FALSE)
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `select()` instead.
## ℹ The deprecated feature was likely used in the wiggleplotr package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `arrange()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the wiggleplotr package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `filter_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `filter()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the wiggleplotr package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `group_by()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the wiggleplotr package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

You might have noticed that since NCOA7 gene has relatively long introns, it can be quite hard to see where all of the exons are. To focus on the exons, we can rescale all introns to a fixed length (50 bp by default):

plotTranscripts(ncoa7_exons, ncoa7_cdss, ncoa7_metadata, rescale_introns = TRUE)

It is now much easier to see, which of the exons can be alternatively spliced and which are shared by all transcripts.

If you are constructing your own transcript annotations, you only need to specify the exons GRanges list for the code to work. In this case, the names of the list will be used as transcript labels on the plot.

plotTranscripts(ncoa7_exons, rescale_introns = TRUE)

##Visualising RNA-seq read coverage We used the NCOA7 example above, because we discovered recently that this gene undergoes alternative promoter usage in human macrophages in response to lipopolysaccharide (LPS) stimulation 1. We’ll now show how the plotCoverage function can be used to visualise RNA-seq read coverage accross the exons of a gene. In addition the the exons, cdss and transcript_annotations paramteres required by plotTranscripts, plotCoverage also requires a track_data data frame containing RNA-seq sample metadata as well as path to the read coverage data in BigWig format.

First, you need to create a data frame containing sample metadata. In our case we have four samples, two from the naive condition and two after LPS stimulation:

sample_data = dplyr::data_frame(
  sample_id = c("aipt_A", "aipt_C", "bima_A", "bima_C"), 
  condition = factor(c("Naive", "LPS", "Naive", "LPS"), levels = c("Naive", "LPS")), 
  scaling_factor = 1)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
sample_data = sample_data %>%
  dplyr::mutate(bigWig = system.file("extdata", paste0(sample_id, ".str2.bw"), 
                                     package = "wiggleplotr"))
as.data.frame(sample_data)
##   sample_id condition scaling_factor
## 1    aipt_A     Naive              1
## 2    aipt_C       LPS              1
## 3    bima_A     Naive              1
## 4    bima_C       LPS              1
##                                                                 bigWig
## 1 /tmp/Rtmp0Wi37B/Rinst1d9067958fed/wiggleplotr/extdata/aipt_A.str2.bw
## 2 /tmp/Rtmp0Wi37B/Rinst1d9067958fed/wiggleplotr/extdata/aipt_C.str2.bw
## 3 /tmp/Rtmp0Wi37B/Rinst1d9067958fed/wiggleplotr/extdata/bima_A.str2.bw
## 4 /tmp/Rtmp0Wi37B/Rinst1d9067958fed/wiggleplotr/extdata/bima_C.str2.bw

Finally, we need to add the track_id and colour_group columns that define which sample belongs to which track and what their colour should be. For simplicity, we first set both of these values equal to the the experimental condition:

track_data = dplyr::mutate(sample_data, track_id = condition, colour_group = condition)

Now, we can make our first read coverage plot

selected_transcripts = c("ENST00000438495", "ENST00000392477") #Plot only two transcripts of the gens
plotCoverage(ncoa7_exons[selected_transcripts], ncoa7_cdss[selected_transcripts], 
             ncoa7_metadata, track_data,
             heights = c(2,1), fill_palette = getGenotypePalette())
## Warning: `mutate_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `mutate()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the wiggleplotr package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `summarise_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `summarise()` instead.
## ℹ The deprecated feature was likely used in the wiggleplotr package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_align()`).

By default, plotCoverage plots the mean read coverage across all of the samples in the same colour group. However, it is also possible to overlay all of the individual samples by setting mean_only = FALSE and alpha < 1.

plotCoverage(ncoa7_exons[selected_transcripts], ncoa7_cdss[selected_transcripts], 
             ncoa7_metadata, track_data,
             heights = c(2,1), fill_palette = getGenotypePalette(), mean_only = FALSE, alpha = 0.5)

It is clear from both plots that the short transcript skipping the first 11 exons of the gene is only expressed after LPS stimulation.

Overlaying multiple conditions

Finally, we can overlay the two conditions in different colours by assigning all of the samples to a single track. This approach can we very useful for visualising eQTLs and splice QTLs. Setting coverage_type = "line" allows us to see both signals even if one overlaps the other:

track_data = dplyr::mutate(sample_data, track_id = "RNA-seq", colour_group = condition)
plotCoverage(ncoa7_exons[selected_transcripts], ncoa7_cdss[selected_transcripts], 
            ncoa7_metadata, track_data,
             heights = c(2,1), fill_palette = getGenotypePalette(), coverage_type = "line")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Unfortunately, it is currently not possible to automatically add legends to the read coverage plots. This is because plotTranscripts uses the cowplot::plot_grid function to align the read coverage and transcript annotations plots and plot_grid does not support legends.

Plotting other types of data

Although wiggleplotr was initially written with RNA-seq data in mind, it can be used equally well to visualise any other sequencing data that can be summarised as read coverage in BigWig format (ATAC-seq, DNAse-seq, ChIP-seq). All you need to do is specify your own exons, cdss, transcript_annotations and track_data parameters. Furthermore, setting connect_exons = FALSE and transcript_label = FALSE makes it easier to plot other types of genomic annotations.

track_data = dplyr::mutate(sample_data, track_id = "RNA-seq", colour_group = condition)
plotCoverage(ncoa7_exons[selected_transcripts], ncoa7_cdss[selected_transcripts], 
            ncoa7_metadata, track_data,
             heights = c(2,1), fill_palette = getGenotypePalette(), coverage_type = "line",
             connect_exons = FALSE, transcript_label = FALSE, rescale_introns = FALSE)

Extract transcript annotations automatically from Ensembl and UCSC annotations objects

In addition specifying your own transcript annotations, wiggleplotr also provides four additional wrapper functions that can extract transcript annotations directly from ensembldb and TxDb (UCSC) objects. For ensembldb, you can use the plotTranscriptsFromEnsembldb and plotCoverageFromEnsembldb functions:

library("ensembldb")
library("EnsDb.Hsapiens.v86")
plotTranscriptsFromEnsembldb(EnsDb.Hsapiens.v86, gene_names = "NCOA7", 
                             transcript_ids = c("ENST00000438495", "ENST00000392477"))

For UCSC transcript annotations in TxDb objects, you can use the corresponding plotTranscriptsFromUCSC and plotCoverageFromUCSC functions:

#Load OrgDb and TxDb objects with UCSC gene annotations
require("org.Hs.eg.db")
require("TxDb.Hsapiens.UCSC.hg38.knownGene")
plotTranscriptsFromUCSC(orgdb = org.Hs.eg.db, txdb = TxDb.Hsapiens.UCSC.hg38.knownGene, 
                        gene_names = "NCOA7", transcript_ids = c("ENST00000438495.6", "ENST00000368357.7"))
## 'select()' returned 1:many mapping between keys and columns

Downloading transcript annotations from Ensembl

The easiest way to access reference transcript annotations in R is to download them directly from Ensembl using the biomaRt R package.

Downloading transcript metadata

First, we want to download transcript metadata, such as which transcripts belong to which genes and what are their names. We can use the biomaRt package to do that. First, let’s define which mart and dataset we want to use.

ensembl_mart = useMart("ENSEMBL_MART_ENSEMBL", host = "jan2020.archive.ensembl.org")
## Warning: Ensembl will soon enforce the use of https.
## Ensure the 'host' argument includes "https://"
ensembl_dataset = useDataset("hsapiens_gene_ensembl",mart=ensembl_mart)
ensembl_dataset
## Object of class 'Mart':
##   Using the ENSEMBL_MART_ENSEMBL BioMart database
##   Using the hsapiens_gene_ensembl dataset

The host helps to make sure that we get the annotations from a specific Ensembl version. For example, Ensembl 78 correseponds to host="dec2014.archive.ensembl.org". You can use the Ensembl Archives website to check which host name corresponds to desired Ensembl version. More information using specific ensembl versions with biomaRt can be found in the biomaRt vignette.

We can see all available attributes with the listAttributes command.

attributes = listAttributes(ensembl_dataset)
head(attributes)
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

Now, let’s select gene id, gene name, transcript id and strand from the biomart and download the corresponding columns.

selected_attributes = c("ensembl_transcript_id", "ensembl_gene_id", 
                        "external_gene_name", "strand", 
                        "gene_biotype", "transcript_biotype")
data = getBM(attributes = selected_attributes, mart = ensembl_dataset)
head(data)
##   ensembl_transcript_id ensembl_gene_id external_gene_name strand
## 1       ENST00000387314 ENSG00000210049              MT-TF      1
## 2       ENST00000389680 ENSG00000211459            MT-RNR1      1
## 3       ENST00000387342 ENSG00000210077              MT-TV      1
## 4       ENST00000387347 ENSG00000210082            MT-RNR2      1
## 5       ENST00000386347 ENSG00000209082             MT-TL1      1
## 6       ENST00000361390 ENSG00000198888             MT-ND1      1
##     gene_biotype transcript_biotype
## 1        Mt_tRNA            Mt_tRNA
## 2        Mt_rRNA            Mt_rRNA
## 3        Mt_tRNA            Mt_tRNA
## 4        Mt_rRNA            Mt_rRNA
## 5        Mt_tRNA            Mt_tRNA
## 6 protein_coding     protein_coding

Finally, we need to rename the columns

data = dplyr::rename(data, 
                     transcript_id = ensembl_transcript_id, 
                     gene_id = ensembl_gene_id, 
                     gene_name = external_gene_name)
head(data)
##     transcript_id         gene_id gene_name strand   gene_biotype
## 1 ENST00000387314 ENSG00000210049     MT-TF      1        Mt_tRNA
## 2 ENST00000389680 ENSG00000211459   MT-RNR1      1        Mt_rRNA
## 3 ENST00000387342 ENSG00000210077     MT-TV      1        Mt_tRNA
## 4 ENST00000387347 ENSG00000210082   MT-RNR2      1        Mt_rRNA
## 5 ENST00000386347 ENSG00000209082    MT-TL1      1        Mt_tRNA
## 6 ENST00000361390 ENSG00000198888    MT-ND1      1 protein_coding
##   transcript_biotype
## 1            Mt_tRNA
## 2            Mt_rRNA
## 3            Mt_tRNA
## 4            Mt_rRNA
## 5            Mt_tRNA
## 6     protein_coding

We can now save the metadata into a file to avoid downloading it every time we need to use it.

temporary_file = tempfile(pattern = "file", tmpdir = tempdir(), fileext = ".rds")
saveRDS(data, temporary_file)

Next time we need to access the metadata, we can load it directly from disk.

transcript_metadata = readRDS(temporary_file)
head(transcript_metadata)
##     transcript_id         gene_id gene_name strand   gene_biotype
## 1 ENST00000387314 ENSG00000210049     MT-TF      1        Mt_tRNA
## 2 ENST00000389680 ENSG00000211459   MT-RNR1      1        Mt_rRNA
## 3 ENST00000387342 ENSG00000210077     MT-TV      1        Mt_tRNA
## 4 ENST00000387347 ENSG00000210082   MT-RNR2      1        Mt_rRNA
## 5 ENST00000386347 ENSG00000209082    MT-TL1      1        Mt_tRNA
## 6 ENST00000361390 ENSG00000198888    MT-ND1      1 protein_coding
##   transcript_biotype
## 1            Mt_tRNA
## 2            Mt_rRNA
## 3            Mt_tRNA
## 4            Mt_rRNA
## 5            Mt_tRNA
## 6     protein_coding

Downloading the full transcript database from Ensembl

However, just the transcript metadata is not enought to use wiggleplotr, we also need the coordinates for all exons. We can get those using the GenomicFeatures packages. First, we use the makeTxDbFromBiomart function to download the full transcript database corresponding to a sepcifc Ensembl version, in this case Ensembl 78. Please note that as the database is quite large, this can take at least a couple of minutes to run.

txdb = makeTxDbFromBiomart(biomart = "ENSEMBL_MART_ENSEMBL", 
                           dataset = "hsapiens_gene_ensembl", 
                           host="jan2020.archive.ensembl.org")

We can save the database to disk to avoid downloading it again every time we want to use it.

txdb_file = tempfile(pattern = "file", tmpdir = tempdir(), fileext = ".rds")
saveDb(txdb, txdb_file)

And we can load it from disk using the loadDb function.

txdb = loadDb(txdb_file)

We can extract exon and coding sequence (CDS) coordinates for all annotated transcripts from the database. The following commands will produce a a list of GRanges objects, each element containing the exons or coding sequences of a single transcript.

exons = exonsBy(txdb, by = "tx", use.names = TRUE)
cdss = cdsBy(txdb, by = "tx", use.names = TRUE)

Finally, we use the newly downloaded annotations to visualise the structure of all protein coding transcripts of NCOA7.

selected_transcripts = transcript_metadata %>%
  dplyr::filter(gene_name == "NCOA7", transcript_biotype == "protein_coding")
tx_ids = selected_transcripts$transcript_id
plotTranscripts(exons[tx_ids], cdss[tx_ids], 
                transcript_metadata, rescale_introns = TRUE)