This package requires local installations of Primer3 and BLASTn. TAPseq was developed and tested using Primer3 v.2.5.0 and blastn v.2.6.0. It’s strongly suggested to use Primer3 >= 2.5.0! Earlier versions require a primer3_config directory, which needs to be specified whenever calling functions interacting with Primer3. Source code and installation instructions can be found under:
Primer3: https://github.com/primer3-org/primer3/releases
BLASTn: https://www.ncbi.nlm.nih.gov/books/NBK279690
Please install these tools first and add them to your PATH. If you don’t want to add the tools to your “global” PATH, you can add the following code to your .Rprofile file. This should add the tools to your PATH in R whenever you start a new session.
Sys.setenv(PATH = paste(Sys.getenv("PATH"), "/full/path/to/primer3-x.x.x/src", sep = ":"))
Sys.setenv(PATH = paste(Sys.getenv("PATH"), "/full/path/to/blast+/ncbi-blast-x.x.x+/bin",
sep = ":"))
Alternatively you can specify the paths to 3rd party software as arguments when calling TAPseq functions (TAPseqInput(), designPrimers(), checkPrimers()).
Let’s start by creating transcript sequences for primer design for a target gene panel. First we try to infer likely polyadenylation (polyA) sites for the selected target genes.
library(TAPseq)
library(GenomicRanges)
library(BiocParallel)
# gene annotations for chromosome 11 genomic region
data("chr11_genes")
# convert to GRangesList containing annotated exons per gene. for the sake of time we only use
# a small subset of the chr11 genes. use the full set for a more realistic example
target_genes <- split(chr11_genes, f = chr11_genes$gene_name)
target_genes <- target_genes[18:27]
# K562 Drop-seq read data (this is just a small example file within the R package)
dropseq_bam <- system.file("extdata", "chr11_k562_dropseq.bam", package = "TAPseq")
# register backend for parallelization
register(MulticoreParam(workers = 2))
# infer polyA sites from Drop-seq data
polyA_sites <- inferPolyASites(target_genes, bam = dropseq_bam, polyA_downstream = 50,
wdsize = 100, min_cvrg = 1, parallel = TRUE)
Here we assume all inferred polyA sites to be correct. In a real-world application it’s adviced to manually inspect the polyA site predictions and remove any obvious false positives. This is easiest done by exporting the polyA sites into a .bed. This file can then be loaded into a genome browser (e.g. IGV) together with the target gene annotations and the .bam file to visually inspect the polyA site predictions.
library(rtracklayer)
export(polyA_sites, con = "path/to/polyA_sites.bed", format = "bed")
export(unlist(target_genes), con = "path/to/target_genes.gtf", format = "gtf")
Once we are happy with our polyA predictions, we then use them to truncate the target gene transcripts. All target gene transcripts overlapping any polyA sites are truncated at the polyA sites, as we assume that they mark the transcripts 3’ ends. By default the most downstream polyA site is taken to truncate a transcript. If multiple transcripts per gene overlap polyA sites, a merged truncated transcript model is generated from all overlapping transcripts. If no transcript of a given gene overlap with any polyA sites a consensus transcript model is created from all transcripts for that gene.
# truncate transcripts at inferred polyA sites
truncated_txs <- truncateTxsPolyA(target_genes, polyA_sites = polyA_sites, parallel = TRUE)
To finalize this part, we need to extract the sequences of the truncated transcripts. Here we use Bioconductor’s BSgenome to export transcript sequences, but the genome sequence could also be imported from a fasta file into a DNAStringSet object.
Primer3 uses boulder-IO records for input and output (see: http://primer3.org/manual.html). TAPseq implements TsIO and TsIOList objects, which store Primer3 input and output for TAP-seq primer design in R’s S4 class system. They serve as the users interface to Primer3 during primer design.
First we create Primer3 input for outer forward primer design based
on the obtained transcript sequences. The reverse primer used in all PCR
reactions is used, so that Primer3 can pick forward primers suitable to
work with it. By default this function uses the 10x Beads-oligo-dT and
right primer, and chooses optimal, minimum and maximum melting
temperatures based on that. If another protocol (e.g. Drop-seq) is used,
these parameters can be adapted (?TAPseqInput
).
# create TAPseq IO for outer forward primers from truncated transcript sequences
outer_primers <- TAPseqInput(txs_seqs, target_annot = truncated_txs,
product_size_range = c(350, 500), primer_num_return = 5)
We can now use Primer3 to design Primers and add them to the objects.
The output TsIO objects contain the designed primers and expected amplicons:
# get primers and pcr products for specific genes
tapseq_primers(outer_primers$HBE1)
pcr_products(outer_primers$HBE1)
# these can also be accessed for all genes
tapseq_primers(outer_primers)
pcr_products(outer_primers)
We have now successfully designed 5 outer primers per target gene. To select the best primer per gene we could just pick the primer with the lowest penalty score. However, we will now use BLAST to try to estimate potential off-target priming for every primer. We will use this to then select the best primer per gene, i.e. the primer with the fewest off-targets.
Creating a BLAST database containing all annotated transcripts and chromosome sequences takes a couple of minutes (and ~2Gb of free storage). For this example, we can generate a limited BLAST database, containing only transcripts of all target genes and the sequence of chromosome 11.
# chromosome 11 sequence
chr11_genome <- DNAStringSet(getSeq(hg38, "chr11"))
names(chr11_genome) <- "chr11"
# create blast database
blastdb <- file.path(tempdir(), "blastdb")
createBLASTDb(genome = chr11_genome, annot = unlist(target_genes), blastdb = blastdb)
In a real-world application one would generate a database containing the entire genome and transcriptome. The database can be saved in a location and used for multiple primer designs whithout having to rebuild it everytime.
library(BSgenome)
# human genome BSgenome object (needs to be istalled from Bioconductor)
hg38 <- getBSgenome("BSgenome.Hsapiens.UCSC.hg38")
# download and import gencode hg38 annotations
url <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz"
annot <- import(url, format = "gtf")
# extract exon annotations for protein-coding genes to build transcripts
tx_exons <- annot[annot$type == "exon" & annot$gene_type == "protein_coding"]
# create blast database
blastdb <- file.path(tempdir(), "blastdb")
createBLASTDb(genome = hg38, annot = tx_exons, blastdb = blastdb)
Once we have our BLAST database, we can use it to estimate the number of exonic, intronic and intergenic off-targets for all primers. Note, that the number of exonic and intronic off-targets is provided on the level of genes, i.e. one exonic off-target means that a given primer might bind in exonic region(s) of one other gene. If the off-target binding site overlaps with exons of two genes, it will be counted twice, as it could bind to transcripts of two genes.
# now we can blast our outer primers against the create database
outer_primers <- blastPrimers(outer_primers, blastdb = blastdb, max_mismatch = 0,
min_aligned = 0.75)
# the primers now contain the number of estimated off-targets
tapseq_primers(outer_primers$IFITM1)
To finalize our set of outer primers, we want to choose the best primer per target gene, i.e. the one with the fewest exonic, intronic and intergenic off-target hits (in that order).
To design nested inner primers we simply need to repeat the same procedure with a smaller product size range.
# create new TsIO objects for inner primers, note the different product size
inner_primers <- TAPseqInput(txs_seqs, target_annot = truncated_txs,
product_size_range = c(150, 300), primer_num_return = 5)
# design inner primers
inner_primers <- designPrimers(inner_primers)
# blast inner primers
inner_primers <- blastPrimers(inner_primers, blastdb = blastdb, max_mismatch = 0,
min_aligned = 0.75)
# pick best primer per target gene
best_inner_primers <- pickPrimers(inner_primers, n = 1, by = "off_targets")
Done! We succesfully designed TAP-seq outer and inner primer sets for our target gene panel.
As an additional step, we can verify if the designed primer sets are compatible for PCR multiplexing. For that we use Primer3’s “check_primers” functionality:
# use checkPrimers to run Primer3's "check_primers" functionality for every possible primer pair
outer_comp <- checkPrimers(best_outer_primers)
inner_comp <- checkPrimers(best_inner_primers)
We can now for instance plot the estimated complementarity scores for every pair. We highlight pairs with a score higher than 47, which is considered “critical” by Primer3 during primer design (see Primer3 for more information).
library(dplyr)
library(ggplot2)
# merge outer and inner complementarity scores into one data.frame
comp <- bind_rows(outer = outer_comp, inner = inner_comp, .id = "set")
# add variable for pairs with any complemetarity score higher than 47
comp <- comp %>%
mutate(high_compl = if_else(primer_pair_compl_any_th > 47 | primer_pair_compl_end_th > 47,
true = "high", false = "ok")) %>%
mutate(high_compl = factor(high_compl, levels = c("ok", "high")))
# plot complementarity scores
ggplot(comp, aes(x = primer_pair_compl_any_th, y = primer_pair_compl_end_th)) +
facet_wrap(~set, ncol = 2) +
geom_point(aes(color = high_compl), alpha = 0.25) +
scale_color_manual(values = c("black", "red"), drop = FALSE) +
geom_hline(aes(yintercept = 47), colour = "darkgray", linetype = "dashed") +
geom_vline(aes(xintercept = 47), colour = "darkgray", linetype = "dashed") +
labs(title = "Complementarity scores TAP-seq primer combinations",
color = "Complementarity") +
theme_bw()
No primer pairs with complementarity scores above 47 were found, so they should be ok to use in multiplex PCRs.
To finish the workflow, we can export the designed primers to data.frames, which can be written to .csv files for easy storage of primer sets.
# create data.frames for outer and inner primers
outer_primers_df <- primerDataFrame(best_outer_primers)
inner_primers_df <- primerDataFrame(best_inner_primers)
# the resulting data.frames contain all relevant primer data
head(outer_primers_df)
Furthermore we can create BED tracks with the genomic coordinates of the primer binding sites for viewing in a genome browser.
# create BED tracks for outer and inner primers with custom colors
outer_primers_track <- createPrimerTrack(best_outer_primers, color = "steelblue3")
inner_primers_track <- createPrimerTrack(best_inner_primers, color = "goldenrod1")
# the output data.frames contain lines in BED format for every primer
head(outer_primers_track)
head(inner_primers_track)
# export tracks to .bed files ("" writes to the standard output, replace by a file name)
exportPrimerTrack(outer_primers_track, con = "")
exportPrimerTrack(inner_primers_track, con = "")
Alternatively, primer tracks can be easily created for outer and inner primers and written to one .bed file, so that both primer sets can be viewed in the same track, but with different colors.
All of the output in this vignette was produced under the following conditions:
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 dplyr_1.1.4
#> [3] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.75.0
#> [5] rtracklayer_1.67.0 BiocIO_1.17.1
#> [7] Biostrings_2.75.3 XVector_0.47.0
#> [9] BiocParallel_1.41.0 GenomicRanges_1.59.1
#> [11] GenomeInfoDb_1.43.2 IRanges_2.41.2
#> [13] S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [15] generics_0.1.3 TAPseq_1.19.0
#> [17] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 farver_2.1.2
#> [3] blob_1.2.4 bitops_1.0-9
#> [5] fastmap_1.2.0 RCurl_1.98-1.16
#> [7] GenomicAlignments_1.43.0 XML_3.99-0.17
#> [9] digest_0.6.37 lifecycle_1.0.4
#> [11] KEGGREST_1.47.0 RSQLite_2.3.9
#> [13] magrittr_2.0.3 compiler_4.4.2
#> [15] rlang_1.1.4 sass_0.4.9
#> [17] tools_4.4.2 yaml_2.3.10
#> [19] knitr_1.49 labeling_0.4.3
#> [21] S4Arrays_1.7.1 bit_4.5.0.1
#> [23] curl_6.0.1 DelayedArray_0.33.3
#> [25] abind_1.4-8 withr_3.0.2
#> [27] purrr_1.0.2 sys_3.4.3
#> [29] grid_4.4.2 colorspace_2.1-1
#> [31] scales_1.3.0 SummarizedExperiment_1.37.0
#> [33] cli_3.6.3 rmarkdown_2.29
#> [35] crayon_1.5.3 httr_1.4.7
#> [37] rjson_0.2.23 DBI_1.2.3
#> [39] cachem_1.1.0 zlibbioc_1.52.0
#> [41] parallel_4.4.2 AnnotationDbi_1.69.0
#> [43] BiocManager_1.30.25 restfulr_0.0.15
#> [45] matrixStats_1.4.1 vctrs_0.6.5
#> [47] Matrix_1.7-1 jsonlite_1.8.9
#> [49] bit64_4.5.2 GenomicFeatures_1.59.1
#> [51] maketools_1.3.1 jquerylib_0.1.4
#> [53] tidyr_1.3.1 glue_1.8.0
#> [55] codetools_0.2-20 gtable_0.3.6
#> [57] UCSC.utils_1.3.0 munsell_0.5.1
#> [59] tibble_3.2.1 pillar_1.10.0
#> [61] htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
#> [63] R6_2.5.1 evaluate_1.0.1
#> [65] lattice_0.22-6 Biobase_2.67.0
#> [67] png_0.1-8 Rsamtools_2.23.1
#> [69] memoise_2.0.1 bslib_0.8.0
#> [71] SparseArray_1.7.2 xfun_0.49
#> [73] MatrixGenerics_1.19.0 buildtools_1.0.0
#> [75] pkgconfig_2.0.3